How to analyze/calculate a lot of csv-logfiles automatically

How to analyze/calculate a lot of csv-logfiles automatically - csv

I have a bunch of logilfes from machines at my work, and i want to analyze them. These files are named like #increasingnumer_date_time.csv and alltogether these are round about 10.000 files (0.5-2MB each) that contain information about temperatures, pressures, status of actors like vents, pumps, ect.
I want that script to do some calculations (forming integrals of the pressures when special conditions are met) with any of these csv-files and store the result of every calculation in a csv/excel-file in the way, that I have a list of the logfile-names and the corresponding result of the calculation...
So it is not needed to put all these files in one super-big megafile, it is totally fine when these files were opened and processed one-after-another and just every result of the calculation is written in the result-file so that i can allocate the results of the calculation to the corresponding csv-file...
How can I do this? I am no programming expert, but I use python/pandas sometimes...
Thank you
I tried to do it file-after-file using excel :-))
I expect that this is a batch job that can be automated (but I do not know how) :))

Related

Understanding openaddresses data format

I have downloaded us-west geolocation data (postal addresses) from openaddresses.io. Some of the addresses in the datasets are not complete i.e., some of them doesn't have info like zip_code. Is there a way to retrieve it or is the data incomplete?
I have tried to search other files hoping to find any related info. The complete dataset doesn't contain any info relate to it. City of Mesa, AZ has multiple zip codes, so it is hard to assign one to the address. Is there any way to address this problem?
This is how data looks like (City of Mesa, AZ)
LON,LAT,NUMBER,STREET,UNIT,CITY,DISTRICT,REGION,POSTCODE,ID,HASH
-111.8747353,33.456605,790,N DOBSON RD,,SRPMIC,,,,,dc0c53196298eb8d
-111.8886227,33.4295194,2630,W RIO SALADO PKWY,,MESA,,,,,c38b700309e1e9ce
-111.8867018,33.4290795,2401,E RIO SALADO PKWY,,TEMPE,,,,,9b912eb2b1300a27
-111.8832045,33.4232903,700,S EVERGREEN RD,,TEMPE,,,,,3435b99ab3f4f828
-111.8761202,33.4296416,2100,W RIO SALADO PKWY,,MESA,,,,,b74349c833f7ee18
-111.8775844,33.4347782,1102,N RIVERVIEW,,MESA,,,,,17d0cf1542c66083

Short Answer: The data incomplete.
The data in OpenAddresses.io is only as complete as the datasource it pulls from. OpenAddresses is just an aggregation of publicly available datasets. There's no real consistency between government agencies that make their data available. As a result, other sections of the OpenAddresses dataset might have city names or zip codes, but there's often something missing.
If you're looking to fill in the missing data, take a look at how projects like Pelias use multiple data sources to augment missing data.
Personally, I always end up going back to OpenStreetMaps (OSM). One could argue that OpenAddresses is better quality because it comes from official sources and doesn't try to fill in data using approximations, but the large gaps of missing data make it far less useful, at least on its own.

Generating truth tables for basic logic circuits

Let's say I have a text file that looks like this:
<number> <name> <type> <inputs...>
1 XOR1 XOR A B
2 SUM XOR 1 C
What would be the best approach to generate the truth table for this circuit?

That depends on what you have available, and how big your file is.
Perl is optimized for reading files and generating simple text output. It doesn't have a library of boolean operators, but they're easy enough to write. I'd use that if I just wanted text-in, text-out.
If I wanted to display the data online AND generate a results file, I'd use PHP to read the data and write the table to a CSV file that could either be opened in Excel, or posted online in an HTML table.
If your data is in a REALLY BIG data file, I'd use SQL.
If your data is in a really huge file that you want to be accessible to authorized users online, and you want THEM to be able to create truth tables, I'd use Oracle's APEX to create an easy interface for them to build their own truth tables and play around with the data without altering it.
If you're in an electrical engineering environment, use the tools designed for your problem -- Verilog or similar.
Whatcha got? Whatcha wanna do with it?
-- Ada

I prefer using C#. I already have the code to 'parse' the input
text file. I just don't know where to start in terms of
actually 'simulating' it. The output can simply be a text file
with inputs and output values – Don 12 mins ago
How many inputs and how many outputs in the circuit you want to simulate?
The size of the simulation determines how it can most easily be run. If the circuit is small(ish), you can enter the inputs and circuit values into vector arrays, then cross them to get the output matrix.
Matlab is ideal for this, as it was written for processing arrays.
Again: Whatcha got, and whatcha wanna do with it?
-- Ada

Counting the number of passes through a CSV file in JMeter

Am I missing an easy way to do this?
I have a CSV file with a number of params in it, and in my test I want to be able to make some of the fields unique across CSV repetitions with a suffix determined by the number of times I've looped through the file.
So suppose my CSV (simplified) had:
abc
def
ghi
I want to generate in the test
abc_1
def_1
ghi_1 <hit EOF>
abc_2
def_2
ghi_2 <hit EOF>
abc_3
def_3
ghi_3
I thought I could set up a counter to run parallel to my CSV loop, but that won't work unless I increment it by 1/n each iteration, where n is the number of lines in my CSV file. Which you can't do because counters are integers.
I'm going to go flail around and see if I can come up with a solution, but in case I'm not successful, has anyone got any suggestions?

I've used an EOF marker row (index column with something like "EOF" or "END", etc) and used an IF controller with either a non-resetting counter OR user-variables incremented via javascript in a BSF element (BSF assertion or whatever, just a mechanism to run the script).
Unfortunately its the best solution I've come up with without putting too much effort into it.

When could a CSV records not have the same number of fields?

I am storing a series of events to a CSV file, each event type comes with a different set of data.
To illustrate, say I have two events (there will be many more):
Running, which has a data set containing speed and incline.
Sleeping, which has a data set containing snores.
There are two options to store this data in CSV records:
Option A
Storing each possible item of data in it's own field...
speed, incline, snores
therefore...
15mph, 20%, ,
, , 12
16mph, 20%, ,
14mph, 20%, ,
Option B
Storing each event in its own record...
event, value1...
therefore...
running, 15mph, 20%
sleeping, 12
running, 16mph, 20%
running, 14mph, 20%
Without a specific CSV specification, the consensus seems to be:
Each record "should" contain the same number of comma-separated fields.
Context
There are a number of events which each have a large & different set of data values.
CSV data is to be of use to other developers (I will/could/should/won't use either structure).
The 'other developers' to be toward the novice end of the spectrum and/or using resource limited systems. CSV is accessible.
The CSV format is being provided non-exclusively as feature not requirement. Although, if said application is providing a CSV file it should be provided in the correct manner from now on.
Question
Would it be valid – in this case - to go with Option B?
Thoughts
Option B maintains a level of human readability, which is an advantage say CSV is read by human not processor. Neither method is more complex to parse using a custom parser, but will Option B void the usefulness of a CSV format with other libraries, frameworks, applications et al. With Option A future changes/versions to the data set of an individual event may break the CSV structure (zombie , , to maintain forwards compatibility); whereas Option B will fail gracefully.
edit
This may be aimed at students and frameworks like OpenFrameworks, Plask, Proccessing et al. where CSV is easier to implement.

Any "other frameworks, libraries and applications" I've ever used all handle CSV parsing differently, so trying to conform to one or many of these standards might over-complicate your end result. My recommendation would be to keep it simple and use what works for your specific task. If human readbility is a requirement, then CSV in the form of Option B would work fine. Otherwise, you may want to consider JSON or XML.

As you say there is no "CSV Standard" with regard to contents. The real answer depend on what you are doing and why. You mention "other frameworks, libraries and applications". The one thing I've learnt is "Dont over engineer". i.e. Don't write reams of code today on the assumption that you will plug it into some other framework tomorrow.
I'd say option B is fine, unless you have specific requirements to use other apps etc.
< edit >
Having re-read your context, I'd probably pick one output format and use it, and forget about having multiple formats:
Having multiple output formats is a source of inconsistency (e.g. bug in one format but not another).
Having multiple formats means more code that needs to be
tested
documented
supported
< /edit >

Is there any reason you can't use XML? Yes, it's slightly more difficult to parse, at least for novices, but if so they probably need the practice. File size would be much greater, of course, but it's compressible.

How do I use awk to parse a fixed-width (NACHA) file format?

My company has a problem: we suspect that the NACHA files we are receiving from one of our application service providers that we use to draw money from our clients are incorrect.
We have all of the ACH agreements and legal mumbo-jumbo in place, so it's not a problem with our use of the ACH network, and we're not receiving word from the banks that things are going wrong, so we suspect that when the file is built from the sales information, it's missing some transactions that we are still getting charged for by our service provider.
My task: Take several months worth of NACHA files and decipher them to find out what was drawn from each customer and what was deposited to our accounts, and then compare them to sales data, bank statements, and other information via Access/Excel. Use MySQL for data.
At this point, awk (or similar Linux command line tool) is the tool that I have; I'm not proficient with 'actual' programming tools or practice, I'm more of a system-and-database administrator. I'm not afraid to get my hands dirty, I just don't have a lot of programming experience in reading this sort of thing with, say, C#.
My chief difficulty is in working with the actual NACHA file format: it's 94 characters wide, with fields determined by their position only, no delimiters. Using awk is (in my previous experience) dependent on the field separator variable, which is either white space or anything else...but I have been unsuccessful using it to tease out fields via position. I need to use something like awk because of the different record types in each file, there are 5 different line types in a file: 1, 5, 6, 8, and 9. Types 1 and 9 are the outer group, with header info, and 5 and 8 are batch header lines. Type 6 lines are details. My original plan was to read the header info into variables and then duplicate it on each line, basically de-normalizing it into a large table (or CSV, in the interim) with one record for each individual transaction, associated with all header info from the batch and the day, so:
[transaction data1, data2],[batch data1, data2],[file info1, info2, etc]
[transaction data1, data2],[batch data1, data2],[file info1, info2, etc]
[transaction data1, data2],[batch data1, data2],[file info1, info2, etc]
I favor building a tool that can do this on a continual basis going forward because it will become part of the data-monitoring we do on a daily/weekly basis.
So, how can I denormalize a NACHA file using awk or some similar tool? If there's a better tool for the job, I'm more than happy to hear about it. I haven't found anything in my searching online, unfortunately.

If you look at the gawk info file (info gawk), there is a section called "3.6 Reading Fixed-Width Data". That may provide the information you need if you're using gawk.
From that file:
The splitting of an input record into fixed-width fields is
specified by assigning a string containing space-separated numbers to
the built-in variable `FIELDWIDTHS'.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008