Parsing CSV of files paths line-by-line [Logic Request] - csv

I have a tricky data set to parse through and have not been able to formulate a processing method for it and several failed ideas have just made me more confused...
Sample data in CSV format - before processing
C:\User1\Videos\videos
10
C:\User1\Videos\videos
22
C:\User2\Videos\videos
8
C:\User2\Videos\videos
67
C:\User3\Videos\videos
18
C:\User3\Videos\videos
12
C:\User4\Videos\videos
90
I'm trying to combine the lengths of the video files in each user's video directory and output a list of each user and the total runtime of all their files.
Result - after processing
C:\User1\Videos\videos
32
C:\User2\Videos\videos
75
C:\User3\Videos\videos
30
C:\User4\Videos\videos
90
I'm looking for pseudocode or any advice really as to how I can achieve this result. I have been unsuccessful in trying to use nested loops and am having a hard time conceptualizing other solutions. I am writing this in VBScript for convenience with file processing in Windows.
Thanks so much for the help in advance, I appreciate any advice you can offer.

First, this is a line delimited format with two lines per record.
1: directory
2: video length
Second, you need only a single loop to read each line of the file and process the data.
Steps
Dim a dic var. Set dic = CreateObject("Scripting.Dictionary").
Dim vars for filepath, userkey, and length value
Loop and read lines from file
Inside the loop, read line 1 and identify user for the record. VBScript has the ability to split the string. Based on the example, if the idea is to aggregate all lengths under User1 no matter what the remaining subfolders are then split string and grab the first path element and use this as the user key. You can check that the second element is Videos to filter, etc or use more elements as the key or as expressed in your results example use the full string as the key for exact matching. Store the user key in a local variable.
Inside the loop, read second line and parse length from line 2, store in local variable length.
Inside the loop, check if key exists in the dictionary then if so get value for key and add to length yielding the sum, add key,sum to dictionary. "dic.Item(userkey) = sum" Else if it does not exist then just add key,value "dic.Item(userkey) = value"
Repeat from step 4 until end of file
List items from dictionary by getting keys and then printing each key and the key's value from the dictionary.
The value stored in the dictionary could be an Object to store more information. Don't forget error handling.

Related

Iterate through multiple json strings to return all possible data points - VB.NET

I have a series of files that contain json strings. I can open up and loop through each file but I'm struggling to loop through each layer of the json to determine ALL possible data points within the files. The data is dynamic so some will contain certain fields / data, some won't. Without manually searching I'm struggling to capture all fields that may exists.
Key = meta
Key = meta data_version
Key = meta created
Key = meta revision
I tried a series of nested loops to iterate through level 1, then level 2 with in level 1 etc, and that gets me the format above where we have meta at level 1 and the following fields in level 2 below that.
The code falls over when the next level is an array.
Dim O2 = (o1(entry1.Key)).ToString
Dim o2j As JObject = JObject.Parse(O2)
Line 2 falls over with the following error.
Error reading JObject from JsonReader. Current JsonReader item is not
an object: StartArray. Path '', line 1, position 1.
I've searched pretty much everywhere, but most examples seem to use a static data format. So can anyone point me to an example somewhere that would suit my needs, or kindly explain how I account for the possibility of an array in the structure.
thanks in advance.

Way to extract columns from a CSV and place them into a dictionary

So basically I'm at a wall with an assignment and it's beginning to really frustrate me. Essentially I have a CSV file and my goal is to count how an the amount of times a string is called. So like column 1 would have a string and column 2 would have a integer connected to it. I ultimately need this to be formatted into a dictionary. Where I am stuck is how the heck do I do this without using imported libraries. I am only allowed to iterate through the file using for loops. Would my best bet be indexing each line and creating that into a string and count how many times that string is called? Any insight would be appreciated.
If you don't want to you any library (and assuming you are using python) you can use a dict comprehension, like this:
with open("data.csv") as file:
csv_as_dict = {line[0]: line[1] for line in file.readlines()}
Note: The question is possibly a duplicate of Creating a dictionary from a csv file?.

Random selection from CSV file in Jmeter

I have a very large CSV file (8000+ items) of URLs that I'm reading with a CSV Data Set Config element. It is populating the path of an HTTP Request sampler and iterating through with a while controller.
This is fine except what I want is have each user (thread) to pick a random URL from the CSV URL list. What I don't want is each thread using CSV items sequentially.
I was able to achieve this with a Random Order Controller with multiple HTTP Request samplers , however 8000+ HTTP Samplers really bogged down jmeter to an unusable state. So this is why I put the HTTP Sampler URLs in the CSV file. It doesn't appear that I can use the Random Order Controller with the CSV file data however. So how can I achieve random CSV data item selection per thread?
There is another way to achieve this:
create a separate thread group
depending on what you want to achieve:
add a (random) loop count -> this will set a start offset for the thread group that does the work
add a loop count or forever and a timer and let it loop while the other thread group is running. This thread group will read a 'pseudo' random line
It's not really random, the file is still read sequentially, but your work thread makes jumps in the file. It worked for me ;-)
There's no random selection function when reading csv data. The reason is you would need to read the whole file into memory first to do this and that's a bad idea with a load test tool (any load test tool).
Other commercial tools solve this problem by automatically re-processing the data. In JMeter you can achieve the same manually by simply sorting the data using an arbitrary field. If you sort by, say Surname, then the result is effectively random distribution.
Note. If you ensure the default All Threads is set for the CSV Data Set Config then the data will be unique in the scope of the JMeter process.
The new Random CSV Data Set Config from BlazeMeter plugin should perfectly fit your needs.
As other answers have stated, the reason you're not able to select a line at random is because you would have to read the whole file into memory which is inefficient.
Rather than trying to get JMeter to handle this on the fly, why not just randomise the file order itself before you start the test?
A scripting language such as perl makes short work of this:
cat unrandom.csv | perl -MList::Util=shuffle -e 'print shuffle<STDIN>' > random.csv
For my case:
single column
small dataset
Non-changing CSV
I just discard using CSV and refer to https://stackoverflow.com/a/22042337/6463291 and use a Bean Preprocessor instead, something like this:
String[] query = new String[]{"csv_element1", "csv_element2", "csv_element3"};
Random random = new Random();
int i = random.nextInt(query.length);
vars.put("randomOption",query[i]);
Performance seems ok, if you got the same issue can try this out.
I am not sure if this will work, but I will anyways suggest it.
Why not divide your URLs in 100 different CSV files. Then in each thread you generate the random number and use that number to identify CSV file to read using __CSVRead function.
CSVRead">http://jmeter.apache.org/usermanual/functions.html#_CSVRead
Now the only part I am not sure if the __CSVRead function reopens the file every time or shares the same file handle across the threads.
You may want to try it. Please share your findings.
A much straight forward solution.
In CSV file, add another column (say B)
apply =RAND() function in the first cell of column B (say B1). This will create random float number.
Drag the cell (say B1) corner to apply for all the corresponding URLs
Sort column B.
your URL will be sorted randomly.
Delete column B.

DT_TEXT concatenating rows on Flat File Import

I have a project that imports a TSV file with a field set as text stream (DT_TEXT).
When I have invalid rows that get redirected, the DT_TEXT fields from my invalid rows gets appended to the first proceeding valid row.
Here's my test data:
Tab-delimited input file: ("tsv IN")
CatID Descrip
y "desc1"
z "desc2"
3 "desc3"
CatID is set as in integer (DT_I8)
Descrip is set as text steam (DT_TEXT)
Here's my basic Data Flow Task:
(I apologize, I cant post images until my rep is above 10 :-/ )
So my 2 invalid rows get redirected, and my 3rd row directs to sucess,
But here is my "Success" output:
"CatID","Descrip"
"3","desc1desc2desc3"
Is this a bug when using DT_TEXT fields? I am fairly new to SSIS, so maybe I misunderstand the use of text streams. I chose to use DT_TEXT as I was having truncation issues with DT_STR.
If its helpful, my tsv Fail output is below:
Flat File Source Error Output Column,ErrorCode,ErrorColumn
x "desc1"
,-1071607676,10
y "desc2"
,-1071607676,10
Thanks in advance.
You should really try and avoid using the DT_TEXT, DT_NTEXT or DT_IMAGE data types within SSIS fields as they can severely impact dataflow performance. The problem is that these types come through not as a CLOB (Character Large OBject), but as a BLOB (Binary Large OBject).
For reference see:
CLOB: http://en.wikipedia.org/wiki/Character_large_object
BLOB: http://en.wikipedia.org/wiki/BLOB
Difference: Help me understand the difference between CLOBs and BLOBs in Oracle
Using DT_TEXT you cannot just pull out the characters as you would from a large array. This type is represented as an array of bytes and can store any type of data, which in your case is not needed and is creating problems concatenating your fields. (I recreated the problem in my environment)
My suggestion would be to stick to the DT_STR for your description, giving it a large OutputColumnWidth. Make it large enough so no truncation will occur when reading from your source file and test it out.

Find Dict Values from csv.DictReader

I'm trying to take a csv file and turn it into a dictionary, via csv.DictReader. After doing this, I want to modify one of the columns of the dictionary, and then write the data into a tsv file. I'm dealing with words and word frequencies in a text.
I've tried using the dict.value() function to obtain the dictionary values, but I get an error message saying "AttributeError: DictReader instance has no attribute "values""
Below is my code:
#calculate frequencies of each word in Jane Austen's "Pride and Prejudice"
import csv
#open file with words and counts for the book, and turn into dictionary
fob = open("P&P.csv", "r")
words = csv.DictReader(fob)
dict = words
#open a file to write the words and frequencies to
fob = open("AustenWords.tsv", "w")
#set total word count
wordcount = 120697
for row in words:
values = dict.values()
print values
Basically, I have the total count of each word in the text (i.e. "a","1937") and I want to find the percentage of the total word count that the word in question uses (thus, for "a", the percentage would be 1937/120697.) Right now my code doesn't have the equation for doing this, but I'm hoping, once I obtain the values of each row, to write a row to the new file with the word and the calculated percentage. If anyone has a better way (or any way!) to do this, I would greatly appreciate any input.
Thanks
To answer the basic question - "why am I getting this error" - when you call csv.DictReader(), the return type is an iterator not a Dictionary.
Each ROW in the iterator is a Dictionary which you can then use for your script:
for row in words:
values = row.values()
print values
Thank goodness for Matt Dunnam's answer (I'd reply to it but I don't see how to). csv.DictReader objects are, quite counter-intuitively, NOT dictionary objects (although I think I am beginning to see some usefulness in why not). As he says, csv.DictReader objects are an iterator (with my intro level to python, I think this is like a list maybe). Each entry in that object (which is not a dictionary) is a dictionary.
So, csv.DictReader returns something like a list of dictionaries, which is not the same as returning one dictionary object, despite the name.
What is nice, so far, is that csv.DictReader did preserve my key values in the first row, and placed them correctly in each of the many dictionary objects that are a part of the iterable object it actually returned (again, it does not return a dictionary object!).
I've wasted about an hour banging my head on this, the documentation is not clear enough, although now that I understand what type of object csv.DictReader returns, the documentation is a lot clearer. I think the documentation says something like how it returns an iterable object, but if you think it returns a dictionary and you don't know if dictionaries are iterable or not then this is easy to read as "returns a dictionary object".
The documentation should say "This does not return a dictionary object, but instead returns an iterable object containing a dictionary object for each entry" or some such thing. As a python newbie who hasn't coded in 20 years, I keep running into problems where the documentation is written by and for experts and it is too dense for beginners.
I'm glad it's there and that people have given their time to it, but it could be made easier for beginners while not reducing its worth to expert pythonistas.