PySpark Extract Substring from Free Form Text

PySpark Extract Substring from Free Form Text - json

I am trying to extract data from a json file using PySpark and the data I need is stored in a free form text field. Each record would contain data similar to the sample below.
I basically need to get the corresponding values under VAL, STAGE, ID, DATE and TIME. The section of text that I need starts with Audit Information and ends prior to the word NOTES. Each line ends with a pipe character and the section I need is usually found in between a large number of text.
Here's how the data looks like unformatted:
---------- Audit Information -------|
TEXT1: TEXT2: TEXT3: |
TEXT4: TEXT5: TEXT6: |
INDICATOR: |
VAL STAGE ID DATE TIME |
310 000 F11 220925 0110440 |
315 001 F14 200926 0110440 |
347 001 220926 0112310 |
NOTES: |
|
---------- Next Section ------------|
And here's how it would appear formatted:
My initial thought was to get the position of Audit Information and use that as starting point. And then next get the position of NOTES to close it off. But not really sure how to proceed from there.

Related

Complex Mail Merge (CSV to Word, CSV to PDF, or Other)

QUESTION:
How do you write an ifStatement for Word or for PDF to calculate multiple rows per matching result?
USEAGE:
What I am trying to do seems fairly straight forward and was very easy when I was able to use MS Access 15 years ago, but with Access being not a possibility anymore, I am hoping somebody has a reasonable solution.
The WHAT:
I am trying to generate Statements/Invoices from a CSV (or spreadsheet of any format) into a nice report layout. Let's say the columns look like this:
First Name | Last Name | Account | Address | Item | Description | Item Total
Jane | Smith | 123 | 111 Main St | Ice Cream | it's really cold | $100.00
This is super easy and I can do in Word within 10 minutes and make it "pretty".
BUT what if there are multiple Items per invoice?
So maybe the CSV looks like:
First Name | Last Name | Account | Address | Item | Description | Item Total
Jane | Smith | 123 | 111 Main St | Ice Cream | it's really cold | $100.00
Jane | Smith | 123 | 111 Main St | Hot Dogs | all beef, all the time | $200.00
I still want there to only be 1 invoice per person but not sure how to do an if statement in Word that would say "If there are multiple items per person, put them on a new row, then total them all together"
I would be glad to have the CSV go into a PDF fillable form if I could get the multiple rows to work - I just cannot figure that portion out.
Other options: I looked at OpenOffice "Base" but couldn't get a nice form for a very custom Report. I researched briefly on how to do something like this on AWS, but without any luck. I don't think Microsoft has anything like Access anymore

You can use Word's Catalogue/Directory Mailmerge facility for this (the terminology depends on the Word version). To see how to do so with any mailmerge data source supported by Word, check out my Microsoft Word Catalogue/Directory Mailmerge Tutorial at:
http://www.msofficeforums.com/mail-merge/38721-microsoft-word-catalogue-directory-mailmerge-tutorial.html
or:
http://www.gmayor.com/Zips/Catalogue%20Mailmerge.zip
The tutorial covers everything from list creation to the insertion & calculation of values in multi-record tables in letters. Do read the tutorial before trying to use the mailmerge document included with it.
Depending on what you're trying to achieve, the field coding for this can be complex. However, since the tutorial document includes working field codes for all of its examples, most of the hard work has already been done for you - you should be able to do little more than copy/paste the relevant field codes into your own mailmerge main document, substitute/insert your own field names and adjust the formatting to get the results you desire. For some worked examples, see the attachments to the posts at:
http://www.msofficeforums.com/mail-merge/9180-mail-merge-duplicate-names-but-different-dollar.html#post23345
http://www.msofficeforums.com/mail-merge/11436-access-word-creating-list-multiple-records.html#post30327
Another option would be to use a DATABASE field in a normal ‘letter’ mailmerge main document and a macro to drive the process. An outline of this approach can be found at: http://answers.microsoft.com/en-us/office/forum/office_2010-word/many-to-one-email-merge-using-tables/8bce1798-fbe8-41f9-a121-1996c14dca5d
Conversely, if you're using a relational database or, Excel workbook with a separate table with just a single instance of each of the grouping criteria, a DATABASE field in a normal ‘letter’ mailmerge main document could be used without the need for a macro. An outline of this approach can be found at:
https://answers.microsoft.com/en-us/msoffice/forum/msoffice_word-mso_winother-mso_2010/mail-merge-to-a-word-table-on-a-single-page/4edb4654-27e0-47d2-bd5f-8642e46fa103
For a working example, see:
http://www.msofficeforums.com/mail-merge/37844-mail-merge-using-one-excel-file-multiple.html
The problem with the DATABASE field, though, is that it won't provide the totals you're after. Nevertheless, if you're going down the macro route, it wouldn't take too much more code to append a totals row to the resulting table.
Alternatively, you may want to try one of the Many-to-One Mail Merge add-ins, from:
Graham Mayor at http://www.gmayor.com/ManyToOne.htm; or
Doug Robbins at https://onedrive.live.com/?cid=5AEDCB43615E886B&id=5AEDCB43615E886B!566
PS: While I'm cognisant of StackOverflow's preference for the substance of answers to be posted here rather than linked to, the complexity in this case is far too great to deal with that way, besides which, one can't post the actual field codes or a document containing them here.

How to loop transpose function for every row containing a specific character on Google Sheets?

This is what I have (each "|" symbol indicates a new column on the same row)
John | Doe
Manager
NY
123-45-67
Fax: 987-54-32
a#b
Jane
Assistant
CA
234-56-78
c#d
Mike | Brown
Analyst | Intern
CA
345-67-89
e#f
However, I am trying to get it to look like the below on Google Sheets:
John Doe | Manager | [empty] | NY | 123-45-67 | Fax: 987-54-32 | a#b
Jane | Assistant | [empty] | CA | 234-56-78 | [empty] | c#d
Mike Brown | Analyst | Intern | CA | 345-67-89 | [empty] | e#f
The names are all formatted in bold font so I can use that property as my identifier to be able to merge last names and first names into the same column. However, not sure how I can leave a column empty if a fax number exists in one record and it doesn't in another.
I ultimately want it to be able to create a new record row after each cell that has a "#" character in it. How much of this is possible? If it can be done, how much of it can be done and how can it be done in Google Sheets?

It took some work, but this can be done without a script, through use of built in functions in the sheet. I will put a link to an example below in a comment, but I cannot promise to keep it there forever. But here is the method.
STEP 0: Add appropriate headers in row 1 for sanity. Reserve column A for a record locator, to be built later. Thus, in the example above "John" sits in B2 and "Doe" is in C2.
STEP 1: Build a column for each type of field, and indicate whether the value is that type. For instance, my column D determines if something is an email with the (draggable) formula =iferror(find("#",B2)>0,false). In column E I determine if something is a name. You could use the bolding idea, but I went for =or((row()=2),D1), which says either I am the first row of data, or the preceding row was an email. This too gets dragged. Similarly to test for states: =and((LEN(B2)=2),(B2=upper(B2))), fax =(upper(mid(B2,1,3))="FAX"), ID =and((len(B2)=9),mid(B2,4,1)="-",mid(B2,7,1)="-"), and finally anything else must be a job =not(or(D2,E2,F2,G2,H2)).
STEP 2: Construct the value from the 2 potential columns. If the second is empty, just the first. Otherwise names get a space between and anything else gets "and" between. =if(ISBLANK(C2),B2,if(E2,B2&" "&C2,B2&" and "&C2))
STEP 3: Construct the record type based on which things were true. I put the type names in the header row, so they will match later, which makes this formula a bit impenetrable, but I hope you get the idea: =if(D2,$S$1,if(E2,$N$1,if(F2,$P$1,if(G2,$R$1,if(H2,$Q$1,if(I2,$O$1,"error")))))). Now in A2 I append the record number to the job type, where the record number is how many names we have met so far, thus (also draggable) =K2&countif($E$2:$E2,true). [The idea of building a key this way comes from Prashanth
STEP 4: I left column L blank for neatness so the results are separate from the input and calculations, and then I put a record number in column M as follows =(row()-1). And now we apply vlookup across the rest of the row to get each field by matching it to the column header of the desired output (name|job|state|ID|fax|email) as follows =vlookup(N$1&$M2,$A:$K,10,false), which is built to be draggable both to the rightmost column (email) and the bottom row (row 4 for record 3, in our example). Missing data show up as #N/A (if you find that ugly, an IfError can make it say something nicer).
I hope this illustrates not only a result, but also a method, with which you could tinker if you saw fit.

Creating SQL Table layout for dynamic document

I apologize if this question is vague, but I'll try to be as clear as possible. I've been given a task where I'm to take a text file, store its content in SQL Server 2008, and automate the creation of a form letter given certain inputs. I've been able to break it into the following generic structure (pay no attention to the content, it's just generic text, but the situational break-down is similar):
Welcome [User],
[if #purchase = true, add this paragraph]
Thank you for purchasing the [device / subscription / subscription and device]
from this business on [date].
[#purchase = true and #return = true, add this paragraph]
I'm sorry you returned it!
...
Signed,
[Author]
[Author Image]
Assuming I'm already able to bring in all the necessary variables (user, purchase, return, date, device or device and subscription or subscription only), how should I go about storing the letter pieces in SQL? would it be considered fine to have a structure like the following:
+-------+-----------------+----------+--------+
| Order | Text | purchase | return |
+-------+-----------------+----------+--------+
| 1 | (1st paragraph) | TRUE | null |
| 2 | (2nd paragraph) | TRUE | FALSE |
+-------+-----------------+----------+--------+
Where I store the contents of the first paragraph as:
Thank you for purchasing the [device / subscription / subscription and device]
from this business on [date].
And then write a stored procedure to piece it together based on the Boolean columns, and find/Replace the bracketed bits with input variables to output the entire letter as a string? It doesn't seem like it would be able to handle much variability, to be honest. Maybe breaking down the document into paragraph and sentence tables?
My ultimate goal would be to output this to either a report I create or, perhaps more ideally, to a Word document (though this is probably a whole different bit of research). Am I way off base here? Any insight is helpful.

you can use replace in select statment
for example
SELECT replace(replace(Text, 'device', #deviceVaribale), 'subscription', #subscriptionVaribale) FROM Order

Creating / Appending a Flat File Destination based on date.

The Backstory:
I have a process that loads physician demographic data into our system. This data can come in at any time and at any interval between updates. The data is what we call "Term-by-Exclusion", meaning that the source file takes precedence, and any physician record in the db that is not in the source file is marked as "Termed" or Inactive.
The Problem:
I need to be able to output the data from the source data, into a flat file destination as a daily report to a companion COBOL system. The source data is loaded into an ETL.PhysicianLoad table prior to processing and the ETL table is wiped prior to each new processing transaction, so retaining a full days' records is not possible as it stands now, without the output file.
Example: ProcessOutput_10152013.txt
The output file ideally needs to be a comprehensive of the entire days' processing. Meaning I want to continuously append to that days' file until the end of that day, then email a notification stating the file is ready for pickup. Any data that comes in after the turn of the day should then be placed in newly created file.
Output should look like this (no headers)
BatchID | LastName | FirstName | MiddleInitial | Date
0001 | Smith | John | A | 10/15/13
0001 | Smith | Sue | R | 10/15/13
0001 | Zeller | Frank | L | 10/15/13
0002 | Peters | Paula | D | 10/15/13
0002 | Rivers | Patrick | E | 10/15/13
0002 | Waters | Oliver | G | 10/15/13
What I am thinking:
I am thinking about using a CurrentDate Variable that will hold the current date comparing it to an expression based variable called FileName which will concatenate the current mmddyyyy to "ProcessOutput_.txt". My thinking is that I should be able to locate a file with that name in the destination folder and if it exists, I should be able to write to it. Otherwise I will have to create a new file. I can then set my Flat File Destination via expression to the FileName Variable.
Can anyone see a better way of doing this or any issues that may arise from this solution I am not seeing?

My thought process was in the right place, but flawed.
Here is how I solved the problem.
After trying to build my control/data flows using the logic in the original question, I discovered that I was working myself into a corner.
So that got me thinking again, how can I do this the easiest possible way
First, do I have the correct Variables defined? No..
CurrentDate - has to be there to define the date portion of the file name.
FileName - has to be present for obvious reasons.
So what did I miss?
FileExists (Type: boolean) - Something that will identify the existence of the file.
PlaceholderFile (Type: String) - Generic FileName Variable
Now what to do with it?
Add a VB Script Task to the control flow, that sets the FileExists flag.
'Check to see if ProspectivePhysician_<currentdate>.txt exists.
Dts.Variables("User::FileExists").Value = File.Exists(Dts.Variables("User::FileName").Value.ToString)
Now that we have the existence of the destination file defined, create the data flow object from the source table. Checking the FileExists Variable in a conditional split. Seperating the data flow into two branches. Create two Flat File Destinations called "Existing" and "New", setting them both to the same flat file location for the time being.
If you attempt to run the package at this point, you will receive Validation Errors from one of the two destinations, as the first is holding ownership of the file and will not allow the second to validate the file.
How to fix this...Use Expressions to swap the actual FileName value back and forth.
For the Existing Flat File Connection String Value, use the following Expression:
#[User::FileExists] == True ? #[User::FileName] : #[User::PlaceholderFile]
For the New Flat File Connection String value, use the following Expression:
#[User::FileExists] == True ? #[User::PlaceholderFile] : #[User::FileName]
Finally, Right click on each of the Flat File Destination Objects in the Data Flow and set the Overwrite property to True on the New Flat File Destination, and False on the Existing Destination. This will assure that the Append action is used on the existing file.

Reporting Services: make room for long value in a field

I'm struggling with Reporting Services to get this done.
I have several reports with a common header which contains some contact info, including e-mail address and web address. If those get too long, they right now just overwrite the data in the next cell below them - which is really really ugly...
So I have a report header something like this:
+-------------------------------------------------------------+
| T I T L E |
| |
| Name E-Mail: (value of e-mail) |
| Address Web Url: (value of web url) |
| Zip City |
| |
+-------------------------------------------------------------+
Those fixed texts ("E-Mail" and "Web Url") are standard Textfields, as are the value field, bound to a set of data that my reports gets from my ASP.NET application. These values (on the right of the header) are contained inside a single "Rectangle" that basically groups everything on the right together.
But if the e-mail is really long, what I'd like to do is "slide" the "Web Url:" label and value down one line, if the e-mail is really long - like this:
+-------------------------------------------------------------+
| T I T L E |
| |
| Name E-Mail: (really really long|
| Address value of e-mail) |
| Zip City Web Url: (value of web url) |
| |
+-------------------------------------------------------------+
But somehow, no matter what I try, I can't get this behavior :-( The Textfield for the e-mail value has its CanGrow property set to true and the contents of the e-mail does extend down into the next line - but there it just overwrites the string value in the "web url" value field....
Any ideas?
Thanks!

#benni_mac_b was right - if the output fields are properly aligned and separated, Reporting Services itself will handle this situation just fine.
In my case, I had ever so slightly overlapping output fields, and for some reason, that seems to have prevented the usual growth of a line and caused the line below to be "overprinted".
I carefully separated those output fields - made sure their height was less than their distance in position, and now it works just fine.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008