How to rearrange this csv email name list - csv

I have a .csv list of emails + names.
Each name can have 1-3 emails along with it (which are currently separated by a comma).
I need to convert this into a .csv list where its 1 name and 1 email.
Here is example:
John Smith,johnsmith1#gmail.com,johnsmith2#gmail.com,johnsmith3#gmail.com
Taylor Smith,taylorsmith#gmail.com
Jack Smith,jacksmith1#gmail.com,jacksmith#gmail.com
...(and there are like 10k more rows)
How can I automatically convert this to:
John Smith,johnsmith1#gmail.com
John Smith,johnsmith2#gmail.com
John Smith,johnsmith3#gmail.com
Taylor Smith,taylorsmith#gmail.com
Jack Smith,jacksmith1#gmail.com
Jack Smith,jacksmith#gmail.com
The main problem here attaching the name to separate rows of the emails that were intially with that name.
I appreciate any help - seems like an easy task, but was stuck on this for few days already, Thanks.

If you can use Notepad++ here is a two steps solution:
Replace > Options: Check • Regular expressions, Uncheck [ ] . matches newline
Rearrange line -> Move Name to end of the line
Search for ^([^,#\n]+),(.+)\R? and replace with $2,$1\n
Step 1 > demo at regex101 (explanation on the right side)
Now the Name can be captured inside a lookahead and used as a replacement.
Search for ([^\n,]++)(?=.*,([^\n,#]+$)),(?:(?1)(?:\n|$))? and replace with $2,$1\n
Step 2 > demo at regex101
Each comma seperated substring gets captured by the first group while the second group captures the Name inside a lookahead and is put before the E-Mail and removed from the end.
After both steps the result should look like your desired outcome. Further info: SO Regex FAQ

Related

How can I parse words when there is only Enter Mark between them in MySql?

I have an interesting data, Country names side by side, I need to get each one of them for spesific id.
I just don't know how can I parse those country names.
When I try to locate that mark,
select locate('¶',facility_country) from table_name,
it only returns me 0, doesn't work. I need to find a way to parse country names from that string.
for each id, I want to multiply my data on countries. Or maybe make a new dim table out of them: id and countries. In order to type my parse code or function, still not sure how to do it though but I can manage, I just need to locate that mark so I can separate.
I tried using Ascii, such like:
select CHAR(182);
This returns me the same mark.
select locate(CHAR(182),facility_country) from table_name;
When I try like that still I can't locate the mark, it only returns me 0.
How can I parse those county names with that Enter Mark? I have done similar things with "," or " " but first time I see something like that.
Edit: When I copy full text it looks like this after I paste:
USA
Australia
Brazil
Canada
France
Germany
Ireland
Israel
Italy
Netherlands
New Zealand
Switzerland
United Kingdom
(stackoverflow puts them side by side like that, on dbeaver this is what I see: )
edit2: #RiggsFolly requested this:
SELECT facility_country from clinical_trials LIMIT 5
output:
There are many lines like the line3.
edit3: #Tangentially Perpendicular solved it. We are looking at rendered image so we don't know what is the raw data, Apperantly its Char(10) and I can locate with it.

Trying to pull the Name and/or ID of the code below, but can only pull the Job-Base-Cost

Below is the code I have now. It pulls the Job-Base-Cost just fine, however I cannot get it to pull the ID and or Name of the item. Can you help?
Link to the sites XML pull.
=importxml("link","//job-base-cost")
This is a sample of one line of the OP's XML file
<job-base-cost id="24693" name="Abaddon Blueprint">109555912.69</job-base-cost>
The OP wants to use the IMPORTXML function to report the ID and Name as well as the Job Cost from the XML data. Presently, the OP's formula is:
=importxml("link","//job-base-cost")
There are two options:
1 - One long column
=importxml("link","//#id | //#name | //job-base-cost")
Note //#id and //#name in the xpath query: // indicate nodes in the document (at any level, not just the root level) and # indicate attributes. The pipe | operator indicates AND. So the plain english query is to display the id, name and job-base-cost.
2 - Three columns (table format)
={IMPORTXML("link","//#name"),IMPORTXML("link","//job-base-cost"),IMPORTXML("link","//#id")}
This creates a series that will display the fields in each of three columns.
Note: there is an arrayformula that uses a single importXML function described in How do I return multiple columns of data using ImportXML in Google Spreadsheets?. Readers may want to look at whether that option can be implemented.
My thanks to #Tanaike for his comment which spurred me to look at how xpath works.

Reshape the dataset into more relational format (Transpose SOME rows and assign them to a data subset)

I have a spreadsheet/csv:
Code:,101,Course Description:,"Introduction to Rocket Science",
Student Name,Lecture Hours,Labs Hours,Test Score,Status
John Galt,48,120,4.7,Passed
James Taggart,50,120,4.9,Passed
...
I need to reshape it to the following view:
Code:,Course Description:,Students,Lecture Hours,Labs Hours,Average Test Score,Teaching Staff
101,"Introduction to Rocket Science",John Galt,48,120,4.7,Passed
101,"Introduction to Rocket Science",James Taggart,50,120,4.9,Passed
...
Beleive it or not, can not get the right idea how to do that despite it seems to be very primitive transformation, is there any silver bullet for this?
Original records (csv) have in a way json-like structure so my first approach was to represent the original data as a vector and then transpose it, (but in this case my resulting table looks like sparced matrix - rows I have transpored are blank in the rest of its values)
Another way Im considering - **serialize it into jsons and then de-serialize** into new spreadsheet (jsonize()) - in this case, Im having problems with merging them properly.
In both ways I have it "half-working";
Can anyone suggest simple and reliable algorithm for this;
Any language, RegEx, any tools, code snippets are very appreciated
Assuming that the pattern you've described here is consistent throughout, there are quite a few different approaches you could take I think, but in all cases you basically can use that fact that the 'Course' rows start with "Code:" but that's never going to be a student name.
You can take advantage of this either by a regular expression find/replace, or within OpenRefine.
Example:
Open file in a text editor that supports regular expressions in
find/replace
Search for lines starting with 'Code:' and add additional commas to the start of the row to shift the course data columns to the
right e.g. search for: ^Code: replace with: ,,,,,^Code:
If you now import the file into OpenRefine then you'll have a project with 10 columns (the 10th col is caused by the trailing
comma at the end of the course data row)
You can now use Transpose (or just rename) on the right-most columns which contain the course data, while leaving the left-most
columns which contain the student details
Isolate the rows that contain the phrase 'Student Name' in the first column and remove them (via a filter or facet)
Move the Course Code/Description columns to the beginning of the project, and use the 'Edit Cells->Fill Down' option on each column to get the values repeated on all the relevant lines
Finally rename the columns as you want, remove any extraneous columns

NiFi : Regular Expression in ExtractText gets CSV header instead of data

I'm working on a flow where I get CSV files. I want to put the records into different directories based on the first field in the CSV record.
For ex, the CSV file would look like this
country,firstname,lastname,ssn,mob_num
US,xxxx,xxxxx,xxxxx,xxxx
UK,xxxx,xxxxx,xxxxx,xxxx
US,xxxx,xxxxx,xxxxx,xxxx
JP,xxxx,xxxxx,xxxxx,xxxx
JP,xxxx,xxxxx,xxxxx,xxxx
I want to get the field value of the first field i.e, country. Put those records into a particular directory. US records goes to US directory, UK records goes to UK directory, and so on.
The flow that I have right now is:
GetFile ----> SplitText(line split count = 1 & header line count = 1) ----> ExtractText (line = (.+)) ----> PutFile(Directory = \tmp\data\${line:getDelimitedField(1)}). I need the header file to be replicated across all the split files for a different purpose. So I need them.
The thing is, the incoming CSV file gets split into multiple flow files with the header successfully. However, the regex that I have given in ExtractText processor evaluates it against the splitted flow files' CSV header instead of the record. So instead of getting US or UK in the "line" attribute, I always get "country". So all the files go to \tmp\data\country. Help me how to resolve this.
I believe getDelimitedField will only work off a singular line and is likely not moving past the newline in your split file.
I would advocate for a slightly different approach in which you could alter your ExtractText to find the country code through a regular expression and avoid the need to include the contents of the file as an attribute.
Using a regex of ^.*\n+(\w+) will capture the first line and the first set of word characters up to the comma and place them in the attribute name you specify in capture group 1. (e.g. country.1).
I have created a template that should get the value you are looking for available at https://github.com/apiri/nifi-review-collateral/blob/master/stackoverflow/42022249/Extract_Country_From_Splits.xml

What can I do with an inconsistent column delimited text file?

I have a text file that looks something like...
firstname:middle:lastname
firstname:middle:lastname
firstname:lastname
firstname:middle:lastname
firstname:lastname
I would like to be able to eventually use this information in a MySQL database, but since the columns are not correct I am not sure what to do. Is there any way to resolve this?
If the data you have is only the above variations, then you can make the assumptions:
First part is the firstname
Last part is the lastname
Therefore if using PHP for example you could use explode to separate the data on the delimeter such as in this case being :.
When looping through each row just assume the last part is the lastname, first part is the firstname and the middle part is the middlename.
You can use count() to find out how many parts are in the specific row you are reading inside the loop. This should allow you to figure out which one is the last part.
If the file is so simple ... the solution is trivial
firstname:middle:lastname
firstname:lastname
if(there are only two columns) { that means we have first and last name }
else { we have first, middle and last name }
If there are more columns, you could maybe resolve data to proper columns if you manage to build a priority list (like in what order they could be missing, for example 'last name > first name > middle name') or/and if you could combine that with data type matching (string/int/double/date) ... anyway you need to gather all your domain knowledge and see if that suffice.