Equalize number of columns in csv data [closed] - csv

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I'm parsing some contact data from a PDF-file converted to csv which has resulted in a different column count per row based on missing entries.
Is there a way to correct this using sed, awk, cut etc by making sure some columns that are easy to pattern match up - fx making sure the email-addresses are in the same columns when available and others like "Lifetime member" or "Guest" when email is not available?
The first column is a person/company's name but the rest is arbitrary. The point is to extract contact information(like email, phone number etc) and put it in the same columns, when available.
My idea would be to check if the email is in the 6th column and if not then add a number of empty columns before it etc
Example data:
Steve Smith;9828;1;+1234 567 2345;Guest;steve#example.org;1;1 12th st;48572 Nowhere
Steve Jobs;+1234 567 2345;noreply#example.org;1;48572 Nowhere
John Smith;9828;1;+1234 567 2345;Lifetime member;1;1 23rd st;48572 Nowhere
Peter Blavounius;2312;peter#blavounius.com
Wanted output:
Steve Smith;9828;1;+1234 567 2345;Guest;steve#example.org;1;1 12th st;48572 Nowhere
Steve Jobs;+1234 567 2345;;;;noreply#example.org;1;;48572 Nowhere
John Smith;9828;1;+1234 567 2345;Lifetime member;1;1 23rd st;48572 Nowhere
Peter Blavounius;2312;;;;peter#blavounius.com

This will get you started but it is not complete, you still need to identify other fields, all I've done so far is identify a couple of the fields to show you the approach:
$ cat tst.awk
BEGIN {
FS=OFS=";"
ARGV[ARGC] = ARGV[ARGC-1]
ARGC++
}
{
name = tel = email = digs4 = ""
for (i=1;i<=NF;i++) {
if (i == 1) {
name=$i; $i=""; nameFld = 1
}
else if ($i ~ /^\+/) {
tel=$i; $i=""; telFld = (i > telFld ? i : telFld)
}
else if ($i ~ /#/) {
email=$i; $i=""; emailFld = (i > emailFld ? i : emailFld)
}
else if ($i ~ /^[0-9]{4}$/) {
digs4=$i; $i=""; digs4Fld = (i > digs4Fld ? i : digs4Fld)
}
}
maxFlds = (NF > maxFlds ? NF : maxFlds)
}
NR>FNR {
for (i=1;i<=maxFlds;i++) {
if (i == nameFld) { $i = name }
else if (i == telFld) { $i = tel }
else if (i == emailFld) { $i = email }
else if (i == digs4Fld) { $i = digs4 }
else { $i = $i } # make sure null fields are present
}
print
}
.
$ awk -f tst.awk file
Steve Smith;9828;1;+1234 567 2345;Guest;steve#example.org;1;1 12th st;48572 Nowhere
Steve Jobs;;;+1234 567 2345;48572 Nowhere;noreply#example.org;;;
John Smith;9828;1;+1234 567 2345;Lifetime member;;1 23rd st;48572 Nowhere;
Peter Blavounius;2312;;;;peter#blavounius.com;;;
It does 2 passes on your input file - the first to identify that largest field number that matches each regexp as that's where you want every field matching that regexp to appear in the output, and the second to identify the fields, clear out their location in the record, and then place every field in the right location.
You could identify what a field means by matching it's context to a regexp like above or by its fixed position in the line (e.g. the persons name is always in field 1) or by it's relative position to something else (e.g. a single digit occurring before vs after the email address or before/after the 3rd field number or....)
Hope it makes sense. Add some printfs and play with it a bit and ask questions if you're confused after that.

Related

Lilypond: Variable page break penalty

In my use of Lilypond, I often face the same kind of problems: Say I have four scores (3-4 lines each) that fit in two pages but not necessarily in one.
I refuse to have page breaks within scores. If possible, I want all the scores on the same page. When it's not possible however, I would like the page break to occur between the first and second scores. If that is not possible either, between the second and the third. And only if that's really necessary between the third and the fourth. That is, by order of preference, | representing the page break:
1 2 3 4 |
1 | 2 3 4
1 2 | 3 4
1 2 3 | 4
Is there a way to achieve that without trying and adding the page breaks myself? Maybe by having page-break penalties going in increasing order after each score (but remaining smaller than the penalty for adding a new page)?
Thank you by advance for your help.
You should use ly:page-turn-breaking (see Optimal page turning in the documentation). You'll probably have to play also with \pageTurn, \noPageTurn, \allowPageTurn in order to have the best control.
Here's a minimal example:
\version "2.19.82"
\header {
title = "Page turn breaking"
}
\paper {
% The default page breaking will make Score 1 end at beginning of page 2.
% The following option prevents this and keeps Score 1 all in the first page.
page-breaking = #ly:page-turn-breaking
}
\score {
\header { piece = "Score 1" }
\new Staff {
\clef "treble_8"
\repeat unfold 14 { c'1*4 }
}
\layout {
indent = 0
system-count = 14
}
}
\score {
\header { piece = "Score 2" }
\new Staff {
\clef "treble_8"
\repeat unfold 13 { e'2 f }
}
\layout {
indent = 0
system-count = 13
}
}

Writing a circuit in ZoKrates to proof age is over 21 years

I am trying to see if I can use ZoKrates in a scenario where a user can prove to the verifier that age is over 21 years without revealing the date of birth. I think its a good use case for zero-knowledge proof but like to understand the best way to implement it.
The circuit code (sample) takes the name of the user as public input(name attestation is done by a trusted authority like DMV and is a most likely combination of offline/online mechanism), then the date of birth which is a private input.
//8297122105 = "Razi" is decimal.
def main(pubName,private yearOfBirth, private centuryOfBirth):
x = 0
y = 0
z = 0
x = if centuryOfBirth == 19 then 1 else 0 fi
y = if yearOfBirth < 98 then 1 else 0 fi
z = if pubName == 8297122105 then 1 else 0 fi
total = x + y + z
result = if total == 3 then 1 else 0 fi
return result
Now, using ./target/release/zokrates generate-proof command get the output that can be used as an input toverifier.sol.
A = Pairing.G1Point(0x24cdd31f8e07e854e859aa92c6e7f761bab31b4a871054a82dc01c143bc424d, 0x1eaed5314007d283486826e9e6b369b0f1218d7930cced0dd0e735d3702877ac);
A_p = Pairing.G1Point(0x1d5c046b83c204766f7d7343c76aa882309e6663b0563e43b622d0509ac8e96e, 0x180834d1ec2cd88613384076e953cfd88448920eb9a965ba9ca2a5ec90713dbc);
B = Pairing.G2Point([0x1b51d6b5c411ec0306580277720a9c02aafc9197edbceea5de1079283f6b09dc, 0x294757db1d0614aae0e857df2af60a252aa7b2c6f50b1d0a651c28c4da4a618e], [0x218241f97a8ff1f6f90698ad0a4d11d68956a19410e7d64d4ff8362aa6506bd4, 0x2ddd84d44c16d893800ab5cc05a8d636b84cf9d59499023c6002316851ea5bae]);
B_p = Pairing.G1Point(0x7647a9bf2b6b2fe40f6f0c0670cdb82dc0f42ab6b94fd8a89cf71f6220ce34a, 0x15c5e69bafe69b4a4b50be9adb2d72d23d1aa747d81f4f7835479f79e25dc31c);
C = Pairing.G1Point(0x2dc212a0e81658a83137a1c73ac56d94cb003d05fd63ae8fc4c63c4a369f411c, 0x26dca803604ccc9e24a1af3f9525575e4cc7fbbc3af1697acfc82b534f695a58);
C_p = Pairing.G1Point(0x7eb9c5a93b528559c9b98b1a91724462d07ca5fadbef4a48a36b56affa6489e, 0x1c4e24d15c3e2152284a2042e06cbbff91d3abc71ad82a38b8f3324e7e31f00);
H = Pairing.G1Point(0x1dbeb10800f01c2ad849b3eeb4ee3a69113bc8988130827f1f5c7cf5316960c5, 0xc935d173d13a253478b0a5d7b5e232abc787a4a66a72439cd80c2041c7d18e8);
K = Pairing.G1Point(0x28a0c6fff79ce221fccd5b9a5be9af7d82398efa779692297de974513d2b6ed1, 0x15b807eedf551b366a5a63aad5ab6f2ec47b2e26c4210fe67687f26dbcc7434d);
Question
Consider a scenario when a user (say Razi) can take the proof above (probably in a form of a QR code) and scan it on a machine (confirms age is over 21) that will run the verifierTx method on the contract. Since the proof explicitly has "Razi" inside the proof and contract can verify the age without knowing the actual date of birth we get a better privacy. However, the challenge is now anyone else can reuse the proof since it was used within the transaction. One way to mitigate this issue is to make sure that either the proof is valid for a limited time or (just may good for one-time use). Another way is to ensure proof of user's identity ("Razi"), in a way that is satisfied beyond doubt (e.g. by confirming identity on blockchain etc.).
Are there ways to make sure proof can be used by a user more than once?
I hope the question and explanation make sense. Happy to elaborate more on this, so let me know.
What you will need is:
Razi owning an ethereum public/private key
a (salted) fingerprint fact (e.g. birthday as unix timestamp) associated with Razi's public ethereum address and endorsed on-chain by an authority
Now you can write a ZoKrates program like this
def main(private field salt, private field birthdayAsUnixTs, field pubFactHashA, field pubFactHashB, field ts) -> (field)
// check that the fact is corresponding to the endorsed salted fact fingerprint onchain
h0, h1 = sha256packed(0,0,salt,birthdayAsUnixTs)
h0 == pubFactHashA
h1 == pubFactHashB
// 18 years is pseudo code only!
field ok = if birthdayAsUnixTs * 18 years <= ts then 1 else 0 fi
return ok
Now in your contract you can
check that msg.sender is the owner of the endorsed fact
require(ts <= now)
call verifier with the proof and public input: (factHash, ts, 1)
You can do that by hashing the proof and adding that hash in a list of "used proofs", so no one can use it again.
Now, ZoKrates add randomness in the generation of the proof in order to prevent revealing that the same witnesss has been used, since zkproofs do not show anything about the witness. So, if you want to prevent the person to use his credential (accredit that he is over 21 years old ) more than once you have to use a nullifier (See ZCash approach in the "How zk-SNARKs are applied to create a shielded transaction" part).
Basically you use a string with the data of Razi nullifier_string = centuryOfBirth+yearOfBirth+pubName and then you publish it Hash nullifier = H(nullifier_string) in a table of revealed nullifiers. In the ZoKrates scheme you have to add the nullifier as a public input and then verify that the nullifier corresponds to the data provided. Something like this:
import "utils/pack/unpack128.code" as unpack
import "hashes/sha256/256bitPadded.code" as hash
import "utils/pack/nonStrictUnpack256.code" as unpack256
def main(pubName,private yearOfBirth, private centuryOfBirth, [2]field nullifier):
field x = if centuryOfBirth == 19 then 1 else 0 fi
field y = if yearOfBirth < 98 then 1 else 0 fi
field z = if pubName == 8297122105 then 1 else 0 fi
total = x + y + z
result = if total == 3 then 1 else 0 fi
null0 = unpack(nullifier[0])
null1 = unpack(nullifier[1])
nullbits = [...null0,...null1]
nullString = centuryOfBirth+yearOfBirth+pubName
unpackNullString = unpack256(nullString)
nullbits == hash(unpackNullString)
return result
This has to be made in order to prevent that Razi provide a random nullifier unrelated to his data.
Once you had done this, you can check if the nullifier provided has been already used if it is registered in the revealed nullifier table.
The problem with this in your case is that the year of birth is a weak number to hash. Someone can do a brute-force attack to the nullifier and reveal the year of birth of Razi. You have to add a strong number in the verification (Razi secret ID? a digital signature?) to prevent this attack.
Note1: I have an old version of ZoKrates, so check the import path right.
Note2: Check the ZoKrates Hash function implementation, you may have problem with the padding of the inputs, the unpack256 function prevent this I suppose, but you can double check this to prevent bugs.

Unpivot Csv Files with changing schemas on linux

From one of our customers, we receive x number of csv files on our sftp server. The files usually vary in terms of header names, column count and of course row count (usually somewhere between a couple of thousand and a couple of million rows, file size do for the most of them not exceed 350 mb). Currently we process all the files through ssis using some custom c# script.
What I want to accomplish is this...Move the entire process to linux (our sftp server), in order to shorten the data-flow and the pre-processing time.
This may very well be a trivial task for a lot of you guys, but I cant say I belong to that category...having no real experience developing on linux.
So how to do this, are there any feasible solutions, in regards to time efficiency, memory consumption etc...
Csv files could look like this, except the number of user columns always change:
eg. Filename: userdata.csv
Question; user1; user2; user3; user4
How old are you; 20; 22; 45; 54
How tall are you; 186; 176; 166; 195
And the output I'm after looks like this:
Question; Value; User; Filename
How old are you; 20; user1; userdata
How old are you; 22; user2; userdata
How old are you; 45; user3; userdata
How old are you; 54; user4; userdata
How tall are you; 186; user1; userdata
How tall are you; 176; user2; userdata
How tall are you; 166; user3; userdata
How tall are you; 195; user4; userdata
Suggestions, advice...anything is most welcome.
Update:
Just to elaborate on the input/output specifics..
input.csv (The result of a questionnaire)
2 questions, "How old are you" and "How tall are you" answered by 4 users, "user1", "user2", "user3" and "user4".
For the purpose of this example "user1" - "user4" is used.
In our live data the users real names are used.
The number of user columns will vary depending on how many participated in the questionnaire.
output.csv
The header row is change to display 4 static fields: Question, Value, User and Filename.
Instead of having a row per question as in the input file, we need a row per user.
The Filename column should hold the name of the input file without extension.
character encoding is UTF-8 and the separator is semicolon. Qualifiers are not used.
So, after a bit of reading in here and a lot of trial and error, it seems i have a working solution. Though it might not be pretty and leaves room for improvement, this is what it got:
A scheduled bash script which loops a filename array and passes the individual filenames to the awk script.
orgFile.sh
#!/bin/sh
shopt -s nullglob
fileList=(*.csv)
for i in "${fileList[#]}"; do
awk -v filename="$i" -f newFile.awk $i
done
newFile.awk
#!/usr/bin/awk -f
function fname(file, a, n)
{
n = split(file, a, ".")
return a[1]
}
BEGIN{
FS = ";"
fn = "done_" filename
print "Question;Value;User;ID" > fn
}
{
if (NR == 1)
{
for (i = 1; i <= NF; i++)
{
headers[i] = $i
}
}
else
{
for (i = 1 ; i <= NF; i++ )
{
if (i > 1)
{
print $1 FS $i FS headers[i] FS fname(filename) >> fn
}
}
}
}

How optimize the research of next free "slot" in mysql?

i've a problem and i can't find an easy solution.
I have self expanding stucture made in this way.
database1 | table1
| table2
....
| table n
.
.
.
databaseN | table 1
table 2
table n
each table has a structire like this:
id|value
each time a number is generated is put into the right database/table/structure (is divided in this way for scalability... would be impossible to manage table of billions of records in a fas way).
the problem that N is not fixed.... but is like a base for calculating numbers (to be precise N is known....62 but I can onlyuse a subset of "digits" that could be different in time).
for exemple I can work only with 0 1 and 2 and after a while (when I've done all the possibilities) I want to add 4 and so on (up to base 62).
I would like to find a simple way to find the 1st free slot to put the next randomly generated id but that could be reverted.
Exemple:
I have 0 1 2 3 as numbers I want use....
the element 2313 is put on dabase 2 table 3 and there will be 13|value into table.
the element 1301 is put on dabase 1 table 3 and there will be 01|value into table.
I would like to generate another number based on the next free slot.
I could test every slot starting from 0 to the biggest number but when there will be milions of records for every database and table this will be impossible.
the next element of the 1st exemple would be 2323(and not 2314 since I'm using only the 0 1 2 3 digits).
I would like som sort of invers code in mysql to give me the 23 slot on table 3 database 2 to transform it into the number. I could randomly generate a number and try to find the nearest free up and down but since the set is variable could not be a good choice.
I hope it will be clear enought to tell me any suggestion ;-)
Use
show databases like 'database%' and a loop to find non-existent databases
show tables like 'table%' and a loop for tables
select count(*) from tableN to see if a table is "full" or not.
To find a free slot, walk the database with count in chunks.
This untested PHP/MySQL implementation will first fill up all existing databases and tables to base N+1 before creating new tables or databases.
The if(!$base) part should be altered if another behaviour is wanted.
The findFreeChunk can also be solved with iteration; but I leave that effort to You.
define (DB_PREFIX, 'database');
define (TABLE_PREFIX, 'table');
define (ID_LENGTH, 2)
function findFreeChunk($base, $db, $table, $prefix='')
{
$maxRecordCount=base**(ID_LENGTH-strlen($prefix));
for($i=-1; ++$i<$base;)
{
list($n) = mysql_fetch_row(mysql_query(
"select count(*) from `$db`.`$table` where `id` like '"
. ($tmp = $prefix. base_convert($i, 10, 62))
. "%'"));
if($n<$maxRecordCount)
{
// incomplete chunk found: recursion
for($k=-1;++$k<$base;)
if($ret = findFreeChunk($base, $db, $table, $tmp)
{ return $ret; }
}
}
}
function findFreeSlot($base=NULL)
{
// find current base if not given
if (!$base)
{
for($base=1; !$ret = findFreeSlot(++$base););
return $ret;
}
$maxRecordCount=$base**ID_LENGTH;
// walk existing DBs
$res = mysql_query("show databases like '". DB_PREFIX. "%'");
$dbs = array ();
while (list($db)=mysql_fetch_row($res))
{
// walk existing tables
$res2 = mysql_query("show tables in `$db` like '". TABLE_PREFIX. "%'");
$tables = array ();
while (list($table)=mysql_fetch_row($res2))
{
list($n) = mysql_fetch_row(mysql_query("select count(*) from `$db`.`$table`"));
if($n<$maxRecordCount) { return findFreeChunk($base, $db, $table); }
$tables[] = $table;
}
// no table with empty slot found: all available table names used?
if(count($tables)<$base)
{
for($i=-1;in_array($tmp=TABLE_PREFIX. base_convert(++$i,10,62),$tables););
if($i<$base) return [$db, $tmp, 0];
}
$dbs[] = $db;
}
// no database with empty slot found: all available database names used?
if(count($dbs)<$base)
{
for($i=-1;in_array($tmp=DB_PREFIX.base_convert(++$i,10,62),$dbs););
if($i<$base) return [$tmp, TABLE_PREFIX. 0, 0];
}
// none: return false
return false;
}
If you are not reusing your slots or not deleting anything, you can of course dump all this and simply remember the last ID to calculate the next one.

Combine Cell Contents into another Cell

So Im working on a Spreedsheet in Google Docs, my question is: I would like to combine say columns A,B,C,ect... in column I with a space between them. I'm currently using something like this =A1&" "&B1&" "&C1&" "&ect... This works fine and dandy but if the cell is blank I would like to ignore it. Should this be done via script or formula?
So in my head I'm thinking if cell A = value then grab it and combine it with B (if that contains a value if not leave blank or skip) But I'm not good at PHP So any help would be great!!! Happy NY to everyone ; )
Here's a custom function that will return a string with the given range values joined by the given separator. Any blanks are skipped.
function joinVals( rangeValues, separator ) {
function notBlank(element) {return element != '';}
return rangeValues[0].filter(notBlank).join(separator);
}
Examples:
A B C Formula Result
1 1 2 3 =joinVals(A1:C1," x ") 1 x 2 x 3
2 1 2 =joinVals(A1:C1," x ") 1 x 2
3 1 3 =joinVals(A1:C1," x ") 1 x 3
4 1 2 3 =joinVals(A1:C1) 1,2,3
By using IF and ISBLANK you can determine whether a cell should be included, or ignored. Like this:
=if(ISBLANK(A1),"",A1 & " ")
That reads "if the cell is blank, ignore it, otherwise echo it with a trailing space." You can daisy-chain a series of those together:
=if(ISBLANK(A1),"",A1 & " ")&if(ISBLANK(B1),"",B1 & " ")&if(ISBLANK(C1),"",C1 & " ")...
That gets pretty long and repetitive. Adding ARRAYFORMULA and JOIN, we can have that repetitive piece apply across a range of cells, A1:F1 in this case:
=JOIN("",ARRAYFORMULA(IF(ISBLANK(A1:F1),"",A1:F1&" ")))