SAP HANA getting CSV data size - csv

I am working with SAP HANA I would like to know if it is possible to get disk size of CSV data with SQL.
The CSV data that I mean is the file \index\SCHEMA NAME\CL\TABLE_NAME\data.csv after export.
Best Regards
Houssem

Nope, there is no way to generate this directly.\
You could, however, do some rough estimation, by looking at M_CS_COLUMNS to see the estimated uncompressed size for each column.
Then you could add six bytes (double byte encoding) for every column * no. of records to account for the enclosing quotation marks and separators between columns.

Related

How to store a row index for very large (csv) files?

I am working with large csv files (>> 10^6 lines) and need a row index for some operations. I need to do comparisons between two versions of the files identifying deletions, so I thought it'd be easiest to include a row index. I guess that the number of lines would quickly render traditional integers inefficient. I'm adverse to the idea of having a column containing lets say 634567775577 in plain text as row index (followed by the actual data row). Are there any best practise suggestions for this scenario?
The resulting files have to remain plain text, so serialisation / sqlite is not an option.
At the moment, I'm considering an index based on the actual row data (for example concatening the row data, converting to base64 or the likes), but would that be more reasonable than a plain integer? There should be no duplicate row within each file, so I guess this could be one way.
Cheers, Sacha
Ps: I heavily modified the initial question for clarification
you can use regular numbers.
Python is not afraid of large numbers :) (well, to the order of magnitude you described...)
just open a python shell and type 10**999 and see that it doesn't overflow or anything.
In Python, there's no actual bit limit for integers. In Python 2, there is technically -- an int is 32 bits, and a long is more than 32 bits. But if you're just declaring the number, that type casting will happen implicitly.
Python 3 just has one type, and it only cares about memory space.
So there's no real reason why you can't use an integer if you really want to add an index.
Python built-in library contains SQLite, a self-contained, one-file-fits-everything DBMS - which contrary to normal perception can be quite performant. If the records are to be consulted by a single application with no concurrency, it compares to specialized DBMS thta requires a separate daemon.
So, essentially, you can dump your CSV to a SQLITE database and create the indices you need - even on all four columns if it is the case.
Here is a template script you could customize to create such a DB -
I guessed the "1000" numbers for number of insert a time, but it could
not be optimal - try tweaking inserting is too slow.
import sqlite3
import csv
inserts_at_time = 1000
def create_and_populate_db(dbfilename, csvfilename):
db = sqlite3.connect(dbfilename)
db.execute("""CREATE TABLE data (col1, col2, col3, col4)""")
for col_name in "col1 col2 col3 col4".split():
db.execute(f"""CREATE INDEX {col_name} ON data ({col_name})""")
with open(csvfilanem) as in_file:
reader = csv.reader(in_file)
next(reader) # skips header row
total = counter = 0
lines = []
while True:
for counter, line in zip(range(inserts_at_time), reader):
lines.append(line)
db.executemany('INSERT INTO data VALUES (?,?,?,?)', lines)
total += counter
counter = 0
lines.clear()
print("\b" * 80, f"Inserted {counter} lines - total {total}")
if counter < inserts_at_time - 1:
break

Bulk export of binary waveform data from oscilloscope to data points (csv preferred)

I'm working with some binary waveform files from various early to mid-90's HP scopes. I am trying to do a bulk conversion (we have over 5000) of the files to CSV's and then upload them into a database. I've tried hexdump, xxd, od, strings, etc. and none of them seem to work. I did hunt down a programmers manual but it's not making a whole lot of sense.
The files have a preamble line as ascii text but then the data points are in binary and for some reason nothing I try can decode them. The preamble gives the data necessary to use the binary values and calculate the correct values. It also states that the data is in WORD format.
:WAV:PRE 2,1,32768,1,+4.000000E-08,-4.9722700001108E-06,0,+2.460630E-04,+2.500000E+00,16384;:WAV:DATA #800065536^W�^W�^W�^
I'm pretty confused.
Have a look at
http://www.naic.edu/~phil/hardware/oscilloscopes/9000A_Programmer_Reference.pdf
specifically page 1-21. After ":WAV:DATA", I think the rest of the chunk above will have 65536 8-bit data bytes (the start of which is represented above by �) . The ^W is probably a delimiter, so you would have to parse that out. Just a thought.
UPDATE: I'm new to oscilloscope data collection and am trying to figure the whole thing out from scratch. So, on further digging, it looks like the data you have provided shows this:
PREamble:
- WORD format (16-bit signed integers split into 2 8-bit bytes)
- If there is a WAV:BYT section, that would specify byte order for each pair
- RAW data
- 32768 data points
- COUNT = 1 (I'm not clear on the meaning of this)
- Next 3 should be X increment, origin, reference
- Next 3 should be Y increment, origin, reference, although the manual that I pointed you at above has many more fields than just these, so you might want to consult your specific scope manual.
DATA:
- On closer examination, I don't think the ^W is a delimiter, I think it is the first byte of the pair (0010111). The � character is apparently a standard "I don't know how to represent this character" web representation. You would need to look at that character as 8 bits also.
- 65536 byte pairs of data
I'm not finding a utility that will do this for you. I think you're going to have to write or acquire some code (Perl, C, Java, Python, VB, etc.) to get this done.

Searching through very large rainbow table file

I am looking for the best way to search through a very large rainbow table file (13GB file). It is a CSV-style file, looking something like this:
1f129c42de5e4f043cbd88ff6360486f; somestring
78f640ec8bf82c0f9264c277eb714bcf; anotherstring
4ed312643e945ec4a5a1a18a7ccd6a70; yetanotherstring
... you get the idea - there are about ~900 Million lines, always with a hash, semicolon, clear text string.
So basically, the program should look if a specific hash is lited in this file.
Whats the fastest way to do this?
Obviously, I can't read the entire file into memory and then put a strstr() on it.
So whats the most efficent way to do this?
read file line by line, always to a strstr();
read larger chunk of the file (e.g. 10.000 lines), do a strstr()
Or would it be more efficient import all this data into an MySQL database and then search for the hash via SQL querys?
Any help is appreciated
The best way to do it would be to sort it and then use a binary search-like algorithm on it. After sorting it, it will take around O(log n) time to find a particular entry where n is the number of entries you have. Your algorithm might look like this:
Keep a start offset and end offset. Initialize the start offset to zero and end offset to the file size.
If start = end, there is no match.
Read some data from the offset (start + end) / 2.
Skip forward until you see a newline. (You may need to read more, but if you pick an appropriate size (bigger than most of your records) to read in step 3, you probably won't have to read any more.)
If the hash you're on is the hash you're looking for, go on to step 6.
Otherwise, if the hash you're on is less than the hash you're looking for, set start to the current position and go to step 2.
If the hash you're on is greater than the hash you're looking for, set end to the current position and go to step 2.
Skip to the semicolon and trailing space. The unhashed data will be from the current position to the next newline.
This can be easily converted into a while loop with breaks.
Importing it into MySQL with appropriate indices and such would use a similarly (or more, since it's probably packed nicely) efficient algorithm.
Your last solution might be the easiest one to implement as you move the whole performance optimizing to the database (and usually they are optimized for that).
strstr is not useful here as it searches a string, but you know a specific format and can jump and compare more goal oriented. Thing about strncmp, and strchr.
The overhead for reading a single line would be really high (as it is often the case for file IO). So I'd recommend reading a larger chunk and perform your search on that chunk. I'd even think about parallelizing the search by reading the next chunk in another thread and do comparison there aswell.
You can also think about using memory mapped IO instead of the standard C file API. Using this you can leave the whole contents loading to the operating system and don't have to care about caching yourself.
Of course restructuring the data for faster access would help you too. For example insert padding bytes so all datasets are equally long. This will provide you "random" access to your data stream as you can easily calculate the position of the nth entry.
I'd start by splitting the single large file into 65536 smaller files, so that if the hash begins with 0000 it's in the file 00/00data.txt, if the hash begins with 0001 it's in the file 00/01data.txt, etc. If the full file was 12 GiB then each of the smaller files would be (on average) 208 KiB.
Next, separate the hash from the string; such that you've got 65536 "hash files" and 65536 "string files". Each hash file would contain the remainder of the hash (the last 12 digits only, because the first 4 digits aren't needed anymore) and the offset of the string in the corresponding string file. This would mean that (instead of 65536 files at an average of 208 KiB each) you'd have 65536 hash files at maybe 120 KiB each and 65536 string files at maybe 100 KiB each.
Next, the hash files should be in a binary format. 12 hexadecimal digits costs 48 bits (not 12*8=96-bits). This alone would halve the size of the hash files. If the strings are aligned on a 4 byte boundary in the strings file then a 16-bit "offset of the string / 4" would be fine (as long as the string file is less than 256 KiB). Entries in the hash file should be sorted in order, and the corresponding strings file should be in the same order.
After all these changes; you'd use the highest 16-bits of the hash to find the right hash file, load the hash file and do a binary search. Then (if found) you'd get the offset for the start of the string (in the strings file) from entry in the hash file, plus get the offset for the next string from next entry in the hash file. Then you'd load data from the strings file, starting at the start of the correct string and ending at the start of the next string.
Finally, you'd implement a "hash file cache" in memory. If your application can allocate 1.5 GiB of RAM, then that'd be enough to cache half of the hash files. In this case (half the hash files cached) you'd expect that half the time the only thing you'd need to load from disk is the string itself (e.g. probably less than 20 bytes) and the other half the time you'd need to load the hash file into the cache first (e.g. 60 KiB); so on average for each lookup you'd be loading about 30 KiB from disk. Of course more memory is better (and less is worse); and if you can allocate more than about 3 GiB of RAM you can cache all of the hash files and start thinking about caching some of the strings.
A faster way would be to have a reversible encoding, so that you can convert a string into an integer and then convert the integer back into the original string without doing any sort of lookup at all. For an example; if all your strings use lower case ASCII letters and are a max. of 13 characters long, then they could all be converted into a 64-bit integer and back (as 26^13 < 2^63). This could lead to a different approach - e.g. use a reversible encoding (with bit 64 of the integer/hash clear) where possible; and only use some sort of lookup (with bit 64 of the integer/hash set) for strings that can't be encoded in a reversible way. With a little knowledge (e.g. carefully selecting the best reversible encoding for your strings) this could slash the size of your 13 GiB file down to "small enough to fit in RAM easily" and be many orders of magnitude faster.

Mathematica import large integers from .csv?

I'm importing some data from a CSV into Mathematica. The first few lines of the CSV look like this:
"a_use","tstart","tend"
"bind items on truck to prevent from flying off",1328661514469,1328661531032
"hang laundry on",1328661531035,1328661541700
"tie firewood with",1328661541702,1328661554940
"anchor tent",1328661554942,1328661559797
Mathematica handles this almost perfectly:
data = Import["mystuff.csv"]
The problem is that those big timestamps get converted into scientific notation, and the precision is lost:
In[283]:= data[[2,2]]
Out[283]= 1.32866*10^12
As you can see, even though 1328661531035 is not the same as 1328661541700, the imported data is no longer precise enough to tell the two apart, since both get imported as 1.32866*10^12. I know Mathematica can handle integers of arbitrary length, so how can I get it to import these numbers as (large) integers instead of converting them into this lossy scientific notation?
What version are you using? No problem on Mma 8.0.1.
If you are creating the CSV file in Excel set the format of the timestamps to Number with zero decimal places (via More Number Formats...)

Is there a fast way of adding or removing content in the middle of a very large file

Say I have a very large file (say > 1GB) and I want to add a single character in the middle of it. Is it possible to do this without reading and writing the whole file out? My current solution is this (in pseudocode):
x = 0
chunk = read 4KB chunk x of input file
if chunkToEdit = x, chunk = addCharacter(chunk)
append chunk to the output file
x = x + 1
repeat last 4 steps until input file is fully read
delete input file
move output file to input file
While that works, it results in 1GB of reading, and 1GB of writing to make a single character change. It also requires a spare 1GB of disk space. What I would rather do is modify the part of the file that needs to be changed in place, so I only have to read and write one part of the file (ie 4KB of reading, and 4KB of writing). Is this possible (or a solution better than my one)?
I thought a solution for this could be possible by the OS fragmenting the file and making a new fragment for the changed section, but I don't know if this capability has been written and exposed to developers.
No. Files don't work like that. If you need to change the size of the file then you need to operate from the modification point to the end.
Unless you're using a file format that can handle insertions/deletions cleanly, but it sounds like you aren't.
Adding a single character in the middle necessarily requires shifting everything after this one character by one character. This necessarily requires that you read and write everything from the point of insertion to the end of the file. A way that uses as little memory as possible to do so would be:
i = 0
read last (n byte * i) of file
write back to file shifted by 1 character
i++
repeat until reaching the point of insertion
write single character
In other words: shift everything in chunks of n bytes by one character starting from the end going backwards through the file to the point of insertion, then insert the character. The farther back in the file you want to insert the character, the faster this will be. If you often want to insert near the beginning of the file, this may not be the best solution.