Platform independent way of doing a touch on a file?

Platform independent way of doing a touch on a file? - language-agnostic

How could a platform independent touch be achieved, without using the actual touch executable? I cannot rely on touch being in PATH, or even existing on a particular system.

Create an empty file and copy the original file + the empty file back to the original file. I just tested it on Windows and it worked.
C:\tmp>dir \utils\emp*
Volume in drive C has no label.
Volume Serial Number is BC80-0D15
Directory of C:\utils
2011-03-14 11:58 0 empty_file
1 File(s) 0 bytes
0 Dir(s) 27,506,368,512 bytes free
C:\tmp>dir *.gif
Volume in drive C has no label.
Volume Serial Number is BC80-0D15
Directory of C:\tmp
2010-10-08 12:00 20,463 cknight.gif
2009-10-30 17:31 1,298,525 img-big.gif
2009-10-30 17:46 225,992 img.gif
3 File(s) 1,544,980 bytes
0 Dir(s) 27,506,368,512 bytes free
C:\tmp>copy /b img.gif+\Utils\empty_file /b img.gif
img.gif
\Utils\empty_file
1 file(s) copied.
C:\tmp>dir *.gif
Volume in drive C has no label.
Volume Serial Number is BC80-0D15
Directory of C:\tmp
2010-10-08 12:00 20,463 cknight.gif
2009-10-30 17:31 1,298,525 img-big.gif
2011-03-14 12:07 225,992 img.gif
3 File(s) 1,544,980 bytes
0 Dir(s) 27,506,368,512 bytes free
C:\tmp>

touch is a pretty simple program. You could easily extract the essential steps and implement them as a subroutine in your system. See the source code.
touch depends on utime(), which is POSIX and should be available on most platforms.
Alternatively, you could just provide your own touch implementation as an executable (if you need to call it from a script).

Related

C program using BIOSDISK function for a removable disk attached with system. Check whether it is ready for access or not. Read the drive parameters

Write C language program to complete the following three tasks using BIOSDISK function.
Suppose one removable disk is attached with your system. Check whether it is ready for access or not. Show appropriate messages in either case.
Read the drive parameters of the first removable disk of the system. (The drive parameters will be returned in the buffer that is passed as a parameter). After reading, write the contents of the buffer in a file.
Format Track number 1 and set the bad-sector Flags (if bad-sectors are present) of first removable disk of your system. The remaining parameters should be as follows:
Head number = 0, Sector number = 1, Total number of sectors (nSect) = 1

use biosdisk(int command No, int drive No, int head No, int track No, int sector No, void *buffer);
Drive is a number that specifies which disk drive is to be used:
0 for the first floppy disk drive, 1 for the second floppy disk drive, 2 for the third, and so on.
For hard disk drives, a drive value of 0x80 specifies the first drive, 0x81 specifies the second, 0x82 the third, and so forth.

Create zip with a precise fixed size

I'd like to create a zip file with a specific size, e.g. 10.000.000 byte.
Let's say that I have to zip a number of files, and they sum up to 8.123.456 byte (random number) once zipped.
I'd like to add "rubbish" to reach exactly the size I chose (10Mb in this case).
Of course I'm assuming the files I have to zip do not exceed my limit.
I don't have any specific language requirement to do that.

Just concatenate a "rubbish" file and a valid Zip file that match together your 10Mb.
The Zip format was made for such situations. Typically, you can concatenate an executable and a Zip file for making an installer or a self-extracting archive.
Here an example (I concatenate a randomly chosen GIF and a Zip). The Info-Zip unzip tool says:
unzip -v appended.zip
Archive: appended.zip
warning [appended.zip]: 1772887 extra bytes at beginning or within zipfile
(attempting to process anyway)
Length Method Size Ratio Date Time CRC-32 Name
-------- ------ ------- ----- ---- ---- ------ ----
186 Defl:N 140 25% 15.02.12 12:03 56202537 zip-ada/debg_za.cmd
349 Defl:N 202 42% 10.02.12 22:19 7718ccec zip-ada/debug.pra
4357 Defl:N 1381 68% 24.09.16 06:43 30f2fef0 zip-ada/demo/demo_csv_into_zip.adb
1015 Defl:N 513 50% 02.10.18 15:49 f0edcf97 zip-ada/demo/demo_unzip.adb
603 Defl:N 310 49% 20.03.16 08:26 b3906614 zip-ada/demo/demo_zip.adb
161483 Defl:N 39845 75% 27.08.16 16:35 9f24d1fe zip-ada/doc/appnote.txt
...
The decompression test is successful (as expected):
unzip -t appended.zip
Archive: appended.zip
warning [appended.zip]: 1772887 extra bytes at beginning or within zipfile
(attempting to process anyway)
testing: zip-ada/debg_za.cmd OK
testing: zip-ada/debug.pra OK
testing: zip-ada/demo/demo_csv_into_zip.adb OK
testing: zip-ada/demo/demo_unzip.adb OK
testing: zip-ada/demo/demo_zip.adb OK
testing: zip-ada/doc/appnote.txt OK
testing: zip-ada/doc/lzma-specification.txt OK
...
No errors detected in compressed data of appended.zip.

Wrap the zipped data in another file formatted as:
<zip file length><zip file content><random data until size is as required>
You know the number of random data because it is the rquired size minus the length of the file minus the size of file length representation.
To unzip, just reverst the process.

You can add you "rubbish" file without compression within your target Zip file.
The added size will be the "rubbish" file's size, plus 76 bytes, plus twice the "rubbish" file's name's length. Some zipping tools add extra metadata on top of that.

Spark CSV GZip to Parquet?

I am using Spark 2.3.1 PySpark (AWS EMR)
I am getting memory errors:
Container killed by YARN for exceeding memory limits
Consider boosting spark.yarn.executor.memoryOverhead
I have input of 160 files, each file approx 350-400 MBytes, each file is a CSV Gzip format.
To read the csv.gz files (with wildcard) and I use this Pyspark
dfgz = spark.read.load("s3://mybucket/yyyymm=201708/datafile_*.csv.gz",
format="csv", sep="^", inferSchema="false", header="false", multiLine="true", quote="^", nullValue="~", schema="id string,...."))
To save the data frame I use this (PySpark)
(dfgz
.write
.partitionBy("yyyymm")
.mode("overwrite")
.format("parquet")
.option("path", "s3://mybucket/mytable_parquet")
.saveAsTable("data_test.mytable")
)
One line of code to save all 160 files.
I tried this with 1 file and it works fine.
Total size for all 160 files (csv.gzip) is about 64 GBytes.
Each file as a pure CSV, when Unzipped is approx 3.5 GBytes. I am assuming Spark may unzip each file in RAM and then convert it to Parquet in RAM ??
I want to convert each csv.gzip file to Parquet format i.e. I want 160 Parquet files as output (ideally).
The task runs for a while and it seems to create 1 Parquet file for each CSV.GZ file. After some time it always fails with Yarn memory error.
I tried various settings for executors memory and memoryOverhead and all results in no change - jobs always fails. I tried memoryOverhead of up to 1-8 GB and executormemory of 8G.
Apart from manually breaking up input 160 files workload into many small workloads what else can I do?
Do I need a Spark cluster with a total RAM capacity of much greater than 64 GB?
I use 4 slave nodes, each has 8 CPU and 16 GB per node (slaves), plus one master of 4 CPU and 8 GB of RAM.
This is (with overhead) less than 64 GB of input gzip csv files I am trying to process but the files are evenly sized of 350-400 MBytes so I dont understand why Spark is throwing memory errors given it can easily process these 1 file at a time per executor, discard it and move on to next file. It does not appear to work this way. I feel it is trying to load all input csv.gzip files into memory but I have no way of knowing it (I am still new to Spark 2.3.1).
Late Update: I managed to get it to work with following memory config:
4 slave nodes, each 8 CPU and 16 GB of RAM
1 master node, 4 CPU and 8 GB of RAM:
spark maximizeResourceAllocation false
spark-defaults spark.driver.memoryOverhead 1g
spark-defaults spark.executor.memoryOverhead 2g
spark-defaults spark.executor.instances 8
spark-defaults spark.executor.cores 3
spark-defaults spark.default.parallelism 48
spark-defaults spark.driver.memory 6g
spark-defaults spark.executor.memory 6g
Needless to say - I cannot explain why this config worked!
Also this took 2 hours+ to process 64 GB of gzip data which seems slow even for a small 4+1 node cluster with total of 32+4 CPU and 64+8 GB of RAM. Perhaps S3 was the bottleneck....
FWIW I just did not expect to micro-manage a database cluster for memory, disk I/O or CPU allocation.
Update 2:
I just ran another load on same cluster with same config, a smaller load of 129 files of same sizes and this load failed with same Yarn memory errors.
I am very disappointed with Spark 2.3.1 memory management.
Thank you for any guidance

Maximum size of xattr in OS X

I would like to use xattr to store some meta-data on my files directly on the files. These are essentially tags that I use for the categorization of files when I do searches on them. My goal is to extend the usual Mac OS X tags by associating more info to each tag, for instance the date of addition of that tag and maybe other thing.
I was thinking to add an xattr to the files, using xattr -w. My first guess would be to store something like a JSON in this xattr value, but I was wondering
1) what are the limits of size I can store in an xattr? (man of xattr is vauge and refers to something called _PC_XATTR_SIZE_BITS which I cannot locate anywhere)
2) anything wrong with storing a JSON formatted string as an xattr?

According to man pathconf, there is a “configurable system limit or option variable” called _PC_XATTR_SIZE_BITS which is
the number of bits used to store maximum extended attribute size in bytes. For
example, if the maximum attribute size supported by a file system is 128K, the value
returned will be 18. However a value 18 can mean that the maximum attribute size can be
anywhere from (256KB - 1) to 128KB. As a special case, the resource fork can have much
larger size, and some file system specific extended attributes can have smaller and preset
size; for example, Finder Info is always 32 bytes.
You can determine the value of this parameter using this small command line tool written in Swift 4:
import Foundation
let args = CommandLine.arguments.dropFirst()
guard let pathArg = args.first else {
print ("File path argument missing!")
exit (EXIT_FAILURE)
}
let v = pathconf(pathArg, _PC_XATTR_SIZE_BITS)
print ("_PC_XATTR_SIZE_BITS: \(v)")
exit (EXIT_SUCCESS)
I get:
31 bits for HFS+ on OS X 10.11
64 bits for APFS on macOS 10.13
as the number of bits used to store maximum extended attribute size. These imply that the actual maximum xattr sizes are somewhere in the ranges
1 GiB ≤ maximum < 2 GiB for HFS+ on OS X 10.11
8 EiB ≤ maximum < 16 EiB for APFS on macOS 10.13

I seem to be able to write at least 260kB, like this by generating 260kB of nulls and converting them to the letter a so I can see them:
xattr -w myattr "$(dd if=/dev/zero bs=260000 count=1|tr '\0' a)" fred
1+0 records in
1+0 records out
260000 bytes transferred in 0.010303 secs (25235318 bytes/sec)
And then read them back with:
xattr -l fred
myattr: aaaaaaaaaaaaaaaaaa...aaa
And check the length returned:
xattr -l fred | wc -c
260009
I suspect this is actually a limit of ARGMAX on the command line:
sysctl kern.argmax
kern.argmax: 262144
Also, just because you can store 260kB in an xattr, that does not mean it is advisable. I don't know about HFS+, but on some Unixy filesystems, the attributes can be stored directly in the inode, but if you go over a certain limit, additional space has to be allocated on disk for the data.
——-
With the advent of High Sierra and APFS to replace HFS+, be sure to test on both filesystems - also make sure that Time Machine backs up and restores the data as well and that utilities such as ditto, tar and the Finder propagate them when copying/moving/archiving files.
Also consider what happens when Email a tagged file, or copy it to a FAT-formatted USB Memory Stick.
I also tried setting multiple attributes on a single file and the following script successfully wrote 1,000 attributes (called attr-0, attr-1 ... attr-999) each of 260kB to a single file - meaning that the file effectively carries 260MB of attributes:
#!/bin/bash
for ((a=1;a<=1000;a++)) ; do
echo Setting attr-$a
xattr -w attr-$a "$(dd if=/dev/zero bs=260000 count=1 2> /dev/null | tr '\0' a)" fred
if [ $? -ne 0 ]; then
echo ERROR: Failed to set attr
exit
fi
done
These can all be seen and read back too - I checked.

Creating "holes" in a binary H.264 bitstream

I am trying to simulate data loss in a video by selectively removing H.264 bitstream data. The data is simply a raw H.264 file, which is essentially a binary file. My plan is to delete 2 bytes for every 100 bytes so as to achieve a 2% loss. Eventually, I will be testing the effectiveness of some motion vector error concealment algorithms.
It would be nice to be able to do this in a Unix environment. So far, I have investigated the command xxd for a bit and I am able to save a specific portion of a hex dump from a binary file. For example, to skip the first 50 bytes of a binary bitstream and save the subsequent 100 bytes, I would do the following:
xxd -s 50 -l 100 inputBinaryFile | xxd -r > outputBinaryFile
I'm hoping to incorporate something similar into a bash script that will automatically delete the last 2 bytes per 100 bytes. Furthermore, I would like the script to skip everything before the second occurrence of the sequence 00 00 01 06 05 (first P-frame SEI start code).
I don't know how much easier this could be in a C-based language but my programming skills are quite limited and I would rather deal with just Linux programming for now if possible.
Thanks.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008