Split sentences by Case change where two words are "stuck" together - html

I am attempting to clean up the following data which has been extracted from HTML.
Some sentences haven't quite split correctly with the Capitalised word at the start of one sentence "stuck" to the preceding word.
The image below illustrates what I am trying to achieve:
So in essence if there is a sentence like: The boy plays with the ballThe Girl plays with the Console in a row. This would split to:
The boy plays with the ball
The Girl plays with the Console
M code so far with the actual data ( must be run in power BI as uses Html.Table function which is not available in excel).
let
Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://echa.europa.eu/registration-dossier/-/registered-dossier/14184/7/1"))}),
#"Added Custom" = Table.AddColumn(Source, "Custom", each if Text.Contains([Column1], "General Population - Hazard via oral route") then [Column1] else null),
#"Filtered Rows" = Table.SelectRows(#"Added Custom", each ([Custom] <> null)),
#"Kept Last Rows" = Table.LastN(#"Filtered Rows", 1),
#"Removed Other Columns" = Table.SelectColumns(#"Kept Last Rows",{"Custom"}),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Other Columns", {{"Custom", Splitter.SplitTextByDelimiter("</dd><dt>", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
#"Added Custom1" = Table.AddColumn(#"Split Column by Delimiter", "Text", each Html.Table([Custom], {{"Custom",":root"}})),
#"Expanded Text" = Table.ExpandTableColumn(#"Added Custom1", "Text", {"Custom"}, {"Custom.1"})
in
#"Expanded Text"

Image still looks incorrect (informationOverall is not split) but if you want to split by character transition, you can do so from the ribbon.

Related

Power Query does not recognize tab as a delimiter in .txt files in the code

this is my first post here, so I apologize in advance if the question has been already answered somewhere or I do something wrong. To summarize the problem:
I am doing some spectroscopy measurements and the data from the software I am using is saved in hundreds of .txt files. All files have the same content: first column refers to the wavelength, the second column is the intensity. Columns are separated from one another with a tab. The idea is to insert all of these .txt files in Power Query, rearrange the columns so there is only one column with the wavelength (since it is always the same for all measurements), and the remaining columns would be intensities (second column) of all inserted files.
Therefore, the desired output should look like this:
Wavelength (1st file), intensity (1st file), intensity (2nd file), intensity (3rd file),..., intensity (last file).
I found this brilliant solution, but the issue is that it works flawlessly if the columns are separated by a comma. I tried changing the code so it recognizes the tab, but stuff that I tried didn't work. I also found about Power Query yesterday, so I am a total beginner here. Here is the code:
let
Source = Folder.Files("C:\Users\xxxxx\Desktop\new"),
// Standard UI; step renamed
FilteredTxt = Table.SelectRows(Source, each [Extension] = ".txt"),
// Standard UI; step renamed
RemovedColumns = Table.RemoveColumns(FilteredTxt,{"Name", "Extension", "Date accessed", "Date modified", "Date created", "Attributes", "Folder Path"}),
// UI add custom column "FileContents" with formula Csv.Document([Content]); step renamed
AddedFileContents = Table.AddColumn(RemovedColumns, "FileContents", each Csv.Document([Content])),
// Standard UI; step renamed
RemovedBinaryContent = Table.RemoveColumns(AddedFileContents,{"Content"}),
// In the next 3 steps, temporary names for the new columns are created ("Column2", "Column3", etcetera)
// Standard UI: add custom Index column, start at 2, increment 1
#"Added Index" = Table.AddIndexColumn(RemovedBinaryContent, "Index", 2, 1),
// Standard UI: select Index column, Transform tab, Format, Add Prefix: "Column"
#"Added Prefix" = Table.TransformColumns(#"Added Index", {{"Index", each "Column" & Text.From(_, "en-US"), type text}}), //type text
// Standard UI:
#"Renamed Columns" = Table.RenameColumns(#"Added Prefix",{{"Index", "ColumnName"}}),
// Now we have the names for the new columns
// Advanced Editor: create a list with records with FileContents (tables) and ColumnNames (text) (1 list item (or record) per txt file in the folder)
// From this list, the resulting table will be build in the next step.
ListOfRecords = Table.ToRecords(#"Renamed Columns"),
// Advanced Editor: use List.Accumulate to build the table with all columns,
// starting with Column1 of the first file (Table.FromList(ListOfRecords{0}[FileContents][Column1], each {_}),)
// adding Column2 of each file for all items in ListOfRecords.
BuildTable = List.Accumulate(ListOfRecords,
Table.FromList(ListOfRecords{0}[FileContents][Column1], each{_}),
(TableSoFar,NewColumn) =>
Table.ExpandTableColumn(Table.NestedJoin(TableSoFar, "Column1", NewColumn[FileContents], "Column1", "Dummy", JoinKind.LeftOuter), "Dummy", {"Column2"}, {NewColumn[ColumnName]})),
#"Sorted Rows" = Table.Sort(BuildTable,{{"Column1", Order.Ascending}})
in
#"Sorted Rows"
//each {_}
//Splitter.SplitTextByWhitespace
This is the output I get when I run the code:
and if I change the first five of rows of .txt files so there is a comma between the columns, I get this:
The desired output (first five rows)
I was trying to change the each{_} in the Table.FromList line towards the end with the Splitter function, but it was not working.
I would be very grateful if you could take a look at the code, and suggest what needs to be changed in order for it to work.
Cheers!
Modify your code to insert the #"Added Prefix2" code as below
#"Added Prefix" = Table.TransformColumns(#"Added Index", {{"Index", each "Column" & Text.From(_, "en-US"), type text}}), //type text
#"Added Prefix2" = Table.TransformColumns(#"Added Prefix" , {{"FileContents", each Table.SplitColumn(_, "Column1", Splitter.SplitTextByEachDelimiter({"#(tab)"}, QuoteStyle.Csv, false), {"Column1", "Column2"})}}),
// Standard UI:
#"Renamed Columns" = Table.RenameColumns(#"Added Prefix2",{{"Index", "ColumnName"}}),
I prefer this version when I do similar. More compact and preserves file names of source files
let Source = Folder.Files("C:\directory\subdirectory"),
#"Filtered Rows" = Table.SelectRows(Source, each ([Extension] = ".txt")),
#"Added Custom1" = Table.AddColumn(#"Filtered Rows", "Custom", each Csv.Document(File.Contents([Folder Path]&"\"&[Name]),[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.None])),
#"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom1", "Custom", {"Column1"}, {"Column1"}),
#"Split Column by Delimiter" = Table.SplitColumn(#"Expanded Custom", "Column1", Splitter.SplitTextByEachDelimiter({"#(tab)"}, QuoteStyle.Csv, false), {"Column1", "Column2"}),
#"Removed Other Columns1" = Table.SelectColumns(#"Split Column by Delimiter",{"Name", "Column1", "Column2"}),
#"Pivoted Column" = Table.Pivot(#"Removed Other Columns1", List.Distinct(#"Removed Other Columns1"[Name]), "Name", "Column2")
in #"Pivoted Column"

Excel Web Powerquery: Excel merges data strings in cells --> How do I delimit the data?

I am using Excel 2016 and would like to download Odds from Oddschecker.com via the Web Powerquery function into an Excel Spreadsheet.
More specifically, I am trying to download the data from this Website:
https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history
The problem I have is that some odds on this Website are being merged without space between them into single cells:
Is there any way in Powerquery to delimit the data strings/odds so that they are not being merged?
Thank you very much in advance for any kind of help.
Another approach in the code below using recursive function fnSearchTR (embedded in the query) to drill down the HTML document until the name "TR" is found (or after 100 iterations just to prevent endless iterating). I noticed that this is the place where the required data is located, at least today.
Remark: I also adjusted the second step in the code to select the "Document".
This is a more dynamic solution as it doesn't matter where in the document structure the "TR" is located; otherwise if the document structure is adjusted, then it is still possible that other "TR"'s are found first, but so far it works.
Otherwise also "TR"'s are found with other content, but these will be filtered out as errors or null values after the data type of the first column is adjusted to date.
This query also uses the function "ExpandTables" from my previous answer (I corrected the typo and added a "x", otherwise no changes in the function).
let
Source = Web.Page(Web.Contents("https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history")),
Data0 = Table.SelectRows(Source, each [Caption] = "Document"){0}[Data],
ChildrenWithTable = Table.SelectRows(Data0, each [Children] is table),
fnSearchTR = (newChildren as table, counter as number) as table =>
let
Combined = Table.Buffer(Table.Combine(newChildren[Children])),
ChildrensChildrenWithTable = Table.AddColumn(newChildren, "ChildrensChildren", each Table.SelectRows([Children], each [Children] is table)),
ChildrensChildrenCombined = Table.Combine(ChildrensChildrenWithTable[ChildrensChildren]),
CombinedAll = if ChildrensChildrenCombined[Name]{0} = "TR"
then ChildrensChildrenCombined
else if Table.RowCount(ChildrensChildrenCombined) = 0 or counter = 100
then Combined
else #fnSearchTR(ChildrensChildrenCombined, counter + 1)
in
CombinedAll,
CombinedAll = if Table.RowCount(ChildrenWithTable) = 0 then Data0 else fnSearchTR(ChildrenWithTable, 0),
#"Filtered Rows" = Table.SelectRows(CombinedAll, each ([Name] = "TR")),
#"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Children"}),
#"Invoked Custom Function" = Table.AddColumn(#"Removed Other Columns", "ExpandTables", each ExpandTables([Children])),
#"Removed Columns" = Table.RemoveColumns(#"Invoked Custom Function",{"Children"}),
#"Expanded ExpandTables" = Table.ExpandTableColumn(#"Removed Columns", "ExpandTables", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}, {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded ExpandTables",{{"Column1", type date}}),
#"Removed Errors" = Table.RemoveRowsWithErrors(#"Changed Type", {"Column1"}),
#"Filtered Rows1" = Table.SelectRows(#"Removed Errors", each ([Column1] <> null))
in
#"Filtered Rows1"
Although I can't test it, since this site is blacklisted in Russian Internet segment, I suppose there are <cr>s or <lf>s there, and they aren't transformed to new lines.
What you need is to run Text.Replace against all cells with data to replace these characters.
But then you'll probably need these values as separate rows, and this is far more complex task. :)
Inspired by Gil Raviv's http://datachant.com/2017/03/30/web-scraping-power-bi-excel-power-query/
Edit April 11, 2017: this solution is highly dependent on the structure of the website, or in other words: yesterday it worked fine, but today it doesn't, unfortunately.
The following query with associated function works with me:
let
Source = Web.Page(Web.Contents("https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history")),
Data0 = Source{1}[Data],
Children = Data0{0}[Children],
Children1 = Children{1}[Children],
Children2 = Children1{4}[Children],
Children3 = Children2{0}[Children],
Children4 = Children3{0}[Children],
Children5 = Children4{0}[Children],
Children6 = Children5{3}[Children],
Children7 = Children6{0}[Children],
Children8 = Children7{1}[Children],
Children9 = Children8{3}[Children],
Children10 = Children9{0}[Children],
Children11 = Children10{2}[Children],
Children12 = Children11{2}[Children],
Children13 = Children12{0}[Children],
Children14 = Children13{1}[Children],
#"Removed Other Columns" = Table.SelectColumns(Children14,{"Children"}),
#"Invoked Custom Function" = Table.AddColumn(#"Removed Other Columns", "EpandTables", each EpandTables([Children])),
#"Expanded EpandTables" = Table.ExpandTableColumn(#"Invoked Custom Function", "EpandTables", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}, {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded EpandTables",{"Children"}),
#"Removed Blank Rows" = Table.SelectRows(#"Removed Columns", each not List.IsEmpty(List.RemoveMatchingItems(Record.FieldValues(_), {"", null}))),
#"Parsed Date" = Table.TransformColumns(#"Removed Blank Rows",{{"Column1", each Date.From(DateTimeZone.From(_)), type date}})
in
#"Parsed Date"
Function ExpandTables (edit: #"Added Custom" line adjusted by adding Table.SelectRows)
(ChildTable as table) =>
let
#"Removed Other Columns1" = Table.SelectColumns(ChildTable,{"Children"}),
#"Added Custom" = Table.AddColumn(#"Removed Other Columns1", "Custom", each try if [Children] is null then null else if [Children][Text]{0} <> null then [Children][Text]{0} else Lines.ToText(List.Transform(Table.SelectRows([Children], each [Children] <> null)[Children], each _[Text]{0})) otherwise null),
#"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Children"}),
#"Transposed Table" = Table.Transpose(#"Removed Columns")
in
#"Transposed Table"
The problem is the HTML for one of the combined cells is:
<td><div class="oo">11/4</div><div class="oi">13/5</div><div class="oo">11/4</div></td>
As far as I know, div layout rules don't imply a newline, so Power Query doesn't insert one. We don't run a full layout engine, so we don't know that the column width means each div should be on its own line.
(If anybody knows more about HTML layout semantics, let me know and I can suggest a fix to my team.)
You can text-replace the HTML like this to inject your own delimiter ; in between the div elements
let
WebPageWithReplace = (url as text, old as text, new as text) =>
let
Source = Web.Contents(url),
TextReplace = Text.ToBinary(Text.Replace(Text.FromBinary(Source), old, new)),
Page = Web.Page(TextReplace)
in
Page,
Invoked = WebPageWithReplace(
"https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history",
"</div><div",
"</div>;<div"),
Data = Invoked{1}[Data]
in
Data
And that way Web.Page will still find and parse the HTML table.

replicating a row with changing a field in Python

I have a large csv file with two columns like this:
Id and vehicle
and I like to replicate the rows and if the vehicle is "truck", but instead put "car".
I have this code, but there is an error
which says
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
what does it mean? where I am wrong?
infilename = r'external carriers.csv'
outfilename = r'outputCSV.csv'
with open(infilename, 'rb') as fp_in, open(outfilename, 'wb') as fp_out:
reader = csv.reader(fp_in, delimiter=",")
writer = csv.writer(fp_out, delimiter=",")
for row in reader:
if len(row) == 2:
if row == "truck":
writer.writerow = "car"
It's obvious, you have opened the file in text mode you need rt :
with open(infilename, 'rt') as fp_in, open(outfilename, 'wt') as fp_out:
Also if you want to check the vehicle type you need to check the row[1] which preserve your car name and then reassign it and write the row to your output file.Also note that you don't need to check the length of your rows since calling the len function can be terrible in term if your performance which has O(n) and for large files (specially with large rows) is very inefficient.
infilename = r'external carriers.csv'
outfilename = r'outputCSV.csv'
with open(infilename, 'rt') as fp_in, open(outfilename, 'wt') as fp_out:
reader = csv.reader(fp_in, delimiter=",")
writer = csv.writer(fp_out, delimiter=",")
for row1,row2 in reader:
if row2 == "truck":
writer.writerow([row1,'car'])

Visibility of report field dependent on the value of another field

I have two columns one [CUS SKU], the other [UPC].
I have 2 specific id's 1234, 1233 and many others but only care if these two show up.
My problem-
If either of these two show up on both columns I only want to display one column and hide the other.
If another id is displayed in both columns display both.
If another id shows up on either column and neither of the two important ids are shown then display in either of the columns.
also the two important id's will sometimes have 0 or 00 in front, how do i accommodate for that in there as well.
this is what i tried in each column but had no luck, it was displaying the same.
=IIF (Fields!CUS_SKU.Value = ("1234") or Fields!CUS_SKU.Value = ("1233") and Fields!UPC.Value = ("1234") or Fields!UPC.Value = ("1233"), True, False)
and
=IIF (Fields!CUS_SKU.Value <> ("1234") or Fields!CUS_SKU.Value <> ("1233") and Fields!UPC.Value = ("1234") or Fields!UPC.Value = ("1233"), False, true)
When mixing AND and OR in a condition, you need to use parens carefully. Try this:
=IIF ((Fields!CUS_SKU.Value = ("1234") or Fields!CUS_SKU.Value = ("1233")) and (Fields!UPC.Value = ("1234") or Fields!UPC.Value = ("1233")), True, False)

SSRS customized pie chart color

I have a doubt here,
I need to show a pie-chart in SSRS, for the student results according to their status(Pass/Fail).......I have only 4 conditions Male-pass,Male-fail,Female-pass,Female-fail,I need to show these things with my own color,
for this am using the switch condition as
=Switch(
((Fields!Gender.Value = "Male")&(Fields!Status.Value="Pass")), "Blue",
((Fields!Gender.Value = "Male")&(Fields!Status.Value="Fail")), "HotPink",
((Fields!Gender.Value = "Female")&(Fields!Status.Value="Fail")), "Orange",
((Fields!Gender.Value = "Female")&(Fields!Status.Value="Pass")),"LimeGreen" )
but in the preview it shows only the default color set, not the customized one, can anyone fix this one...thanks in advance
Try using something like
=IIf((Fields!Gender.Value = "Male") and (Fields!Status.Value="Pass"),"Green",
IIf((Fields!Gender.Value = "Male") and(Fields!Status.Value="Fail"),"Red" ,
IIf((Fields!Gender.Value = "Female") and (Fields!Status.Value="Fail") ,"Blue",
(Fields!Gender.Value = "Female") and (Fields!Status.Value="Pass"),"Yellow","Orange"
,"#00000000"))))
You should be able to get it working using the Switch statement as well. The problem with your expression is that the logical "and" operator in SSRS is And, not ampersand. In SSRS, a single ampersand is used for concatenating strings. So your expression is concatenating the string representation of the two boolean results, resulting in strings like TrueFalse. This should actually giving an error on the Switch evaluation.
A correct Switch statement would be this:
=Switch(
Fields!Gender.Value = "Male" And Fields!Status.Value="Pass", "Blue",
Fields!Gender.Value = "Male" And Fields!Status.Value="Fail", "HotPink",
Fields!Gender.Value = "Female" And Fields!Status.Value="Fail", "Orange",
Fields!Gender.Value = "Female" And Fields!Status.Value="Pass","LimeGreen"
, True, "SomeOtherColor"
)
I've also added an "else" part to the switch in case some records are not covered by the other conditions. If you're 100% sure that won't happen, you can remove the line that starts with "True". But it shouldn't hurt to keep it either.
More info: Pie Chart Techniques (look for Custom Coloring chapter)