I am attempting to clean up the following data which has been extracted from HTML.
Some sentences haven't quite split correctly with the Capitalised word at the start of one sentence "stuck" to the preceding word.
The image below illustrates what I am trying to achieve:
So in essence if there is a sentence like: The boy plays with the ballThe Girl plays with the Console in a row. This would split to:
The boy plays with the ball
The Girl plays with the Console
M code so far with the actual data ( must be run in power BI as uses Html.Table function which is not available in excel).
let
Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://echa.europa.eu/registration-dossier/-/registered-dossier/14184/7/1"))}),
#"Added Custom" = Table.AddColumn(Source, "Custom", each if Text.Contains([Column1], "General Population - Hazard via oral route") then [Column1] else null),
#"Filtered Rows" = Table.SelectRows(#"Added Custom", each ([Custom] <> null)),
#"Kept Last Rows" = Table.LastN(#"Filtered Rows", 1),
#"Removed Other Columns" = Table.SelectColumns(#"Kept Last Rows",{"Custom"}),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Other Columns", {{"Custom", Splitter.SplitTextByDelimiter("</dd><dt>", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
#"Added Custom1" = Table.AddColumn(#"Split Column by Delimiter", "Text", each Html.Table([Custom], {{"Custom",":root"}})),
#"Expanded Text" = Table.ExpandTableColumn(#"Added Custom1", "Text", {"Custom"}, {"Custom.1"})
in
#"Expanded Text"
Image still looks incorrect (informationOverall is not split) but if you want to split by character transition, you can do so from the ribbon.
I have a CSV file where 2 columns contain several different text values e.g.
Column 1: Reptiles, Health, Hygiene
Column 2: Purity
I need to use VBscript to split these columns into a new CSV file without changing the current file, expected output in new CSV file shown below:
Column 1 Column 2
Reptiles Reptiles
Health Health
Hygiene Hygiene
Purity Purity
Unfortunately(?) it must be done with VB Script and nothing else.
Here is an example of how the data looks (of course the data consistently repeats with some extra entries through the same columns in file 1.
And here is an example of how it needs to look but it needs to repeat down until all unique entries from Column 1 and 2 in the original file have been input as a single entry to Column 1 in the new file and copied to Column 2 in the same new file. e.g.
Examples in text format as requested:
Original file:
Column 1,Column 2
"Reptiles, Health, Hygiene",Purity
New File:
Column 1,Column 2
Reptiles,Reptiles
Health,Health
Hygiene,Hygiene
Purity,Purity
I think this is a simple matter of using the FileSystemObject with Split function.
Assuming each input line is just one set of data you can remove the double quotes and process from there
Try this VB script out (edited to process header line separately):
Const Overwrite = True
Set ObjFso = CreateObject("Scripting.FileSystemObject")
Set ObjOutFile = ObjFso.CreateTextFile("My New File Path", Overwrite)
Set ObjInFile = ObjFso.OpenTextFile("My Old File Path")
' Skip processing first header line and just write it out as is
strLine = ObjInFile.ReadLine
ObjOutFile.WriteLine strLine
Do Until ObjInFile.AtEndOfStream
' Remove all double quotes to treat this as one set of data
strLine = Replace(ObjInFile.ReadLine, """","")
varData = Split(strLine,",")
' Write out each element twice into its own line
For i = 0 to uBound(varData)
ObjOutFile.WriteLine varData(i) & "," & varData(i)
Next i
Loop
ObjInFile.Close
ObjOutFile.Close
I am using Excel 2016 and would like to download Odds from Oddschecker.com via the Web Powerquery function into an Excel Spreadsheet.
More specifically, I am trying to download the data from this Website:
https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history
The problem I have is that some odds on this Website are being merged without space between them into single cells:
Is there any way in Powerquery to delimit the data strings/odds so that they are not being merged?
Thank you very much in advance for any kind of help.
Another approach in the code below using recursive function fnSearchTR (embedded in the query) to drill down the HTML document until the name "TR" is found (or after 100 iterations just to prevent endless iterating). I noticed that this is the place where the required data is located, at least today.
Remark: I also adjusted the second step in the code to select the "Document".
This is a more dynamic solution as it doesn't matter where in the document structure the "TR" is located; otherwise if the document structure is adjusted, then it is still possible that other "TR"'s are found first, but so far it works.
Otherwise also "TR"'s are found with other content, but these will be filtered out as errors or null values after the data type of the first column is adjusted to date.
This query also uses the function "ExpandTables" from my previous answer (I corrected the typo and added a "x", otherwise no changes in the function).
let
Source = Web.Page(Web.Contents("https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history")),
Data0 = Table.SelectRows(Source, each [Caption] = "Document"){0}[Data],
ChildrenWithTable = Table.SelectRows(Data0, each [Children] is table),
fnSearchTR = (newChildren as table, counter as number) as table =>
let
Combined = Table.Buffer(Table.Combine(newChildren[Children])),
ChildrensChildrenWithTable = Table.AddColumn(newChildren, "ChildrensChildren", each Table.SelectRows([Children], each [Children] is table)),
ChildrensChildrenCombined = Table.Combine(ChildrensChildrenWithTable[ChildrensChildren]),
CombinedAll = if ChildrensChildrenCombined[Name]{0} = "TR"
then ChildrensChildrenCombined
else if Table.RowCount(ChildrensChildrenCombined) = 0 or counter = 100
then Combined
else #fnSearchTR(ChildrensChildrenCombined, counter + 1)
in
CombinedAll,
CombinedAll = if Table.RowCount(ChildrenWithTable) = 0 then Data0 else fnSearchTR(ChildrenWithTable, 0),
#"Filtered Rows" = Table.SelectRows(CombinedAll, each ([Name] = "TR")),
#"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Children"}),
#"Invoked Custom Function" = Table.AddColumn(#"Removed Other Columns", "ExpandTables", each ExpandTables([Children])),
#"Removed Columns" = Table.RemoveColumns(#"Invoked Custom Function",{"Children"}),
#"Expanded ExpandTables" = Table.ExpandTableColumn(#"Removed Columns", "ExpandTables", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}, {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded ExpandTables",{{"Column1", type date}}),
#"Removed Errors" = Table.RemoveRowsWithErrors(#"Changed Type", {"Column1"}),
#"Filtered Rows1" = Table.SelectRows(#"Removed Errors", each ([Column1] <> null))
in
#"Filtered Rows1"
Although I can't test it, since this site is blacklisted in Russian Internet segment, I suppose there are <cr>s or <lf>s there, and they aren't transformed to new lines.
What you need is to run Text.Replace against all cells with data to replace these characters.
But then you'll probably need these values as separate rows, and this is far more complex task. :)
Inspired by Gil Raviv's http://datachant.com/2017/03/30/web-scraping-power-bi-excel-power-query/
Edit April 11, 2017: this solution is highly dependent on the structure of the website, or in other words: yesterday it worked fine, but today it doesn't, unfortunately.
The following query with associated function works with me:
let
Source = Web.Page(Web.Contents("https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history")),
Data0 = Source{1}[Data],
Children = Data0{0}[Children],
Children1 = Children{1}[Children],
Children2 = Children1{4}[Children],
Children3 = Children2{0}[Children],
Children4 = Children3{0}[Children],
Children5 = Children4{0}[Children],
Children6 = Children5{3}[Children],
Children7 = Children6{0}[Children],
Children8 = Children7{1}[Children],
Children9 = Children8{3}[Children],
Children10 = Children9{0}[Children],
Children11 = Children10{2}[Children],
Children12 = Children11{2}[Children],
Children13 = Children12{0}[Children],
Children14 = Children13{1}[Children],
#"Removed Other Columns" = Table.SelectColumns(Children14,{"Children"}),
#"Invoked Custom Function" = Table.AddColumn(#"Removed Other Columns", "EpandTables", each EpandTables([Children])),
#"Expanded EpandTables" = Table.ExpandTableColumn(#"Invoked Custom Function", "EpandTables", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}, {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded EpandTables",{"Children"}),
#"Removed Blank Rows" = Table.SelectRows(#"Removed Columns", each not List.IsEmpty(List.RemoveMatchingItems(Record.FieldValues(_), {"", null}))),
#"Parsed Date" = Table.TransformColumns(#"Removed Blank Rows",{{"Column1", each Date.From(DateTimeZone.From(_)), type date}})
in
#"Parsed Date"
Function ExpandTables (edit: #"Added Custom" line adjusted by adding Table.SelectRows)
(ChildTable as table) =>
let
#"Removed Other Columns1" = Table.SelectColumns(ChildTable,{"Children"}),
#"Added Custom" = Table.AddColumn(#"Removed Other Columns1", "Custom", each try if [Children] is null then null else if [Children][Text]{0} <> null then [Children][Text]{0} else Lines.ToText(List.Transform(Table.SelectRows([Children], each [Children] <> null)[Children], each _[Text]{0})) otherwise null),
#"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Children"}),
#"Transposed Table" = Table.Transpose(#"Removed Columns")
in
#"Transposed Table"
The problem is the HTML for one of the combined cells is:
<td><div class="oo">11/4</div><div class="oi">13/5</div><div class="oo">11/4</div></td>
As far as I know, div layout rules don't imply a newline, so Power Query doesn't insert one. We don't run a full layout engine, so we don't know that the column width means each div should be on its own line.
(If anybody knows more about HTML layout semantics, let me know and I can suggest a fix to my team.)
You can text-replace the HTML like this to inject your own delimiter ; in between the div elements
let
WebPageWithReplace = (url as text, old as text, new as text) =>
let
Source = Web.Contents(url),
TextReplace = Text.ToBinary(Text.Replace(Text.FromBinary(Source), old, new)),
Page = Web.Page(TextReplace)
in
Page,
Invoked = WebPageWithReplace(
"https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history",
"</div><div",
"</div>;<div"),
Data = Invoked{1}[Data]
in
Data
And that way Web.Page will still find and parse the HTML table.
I have a large csv file with two columns like this:
Id and vehicle
and I like to replicate the rows and if the vehicle is "truck", but instead put "car".
I have this code, but there is an error
which says
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
what does it mean? where I am wrong?
infilename = r'external carriers.csv'
outfilename = r'outputCSV.csv'
with open(infilename, 'rb') as fp_in, open(outfilename, 'wb') as fp_out:
reader = csv.reader(fp_in, delimiter=",")
writer = csv.writer(fp_out, delimiter=",")
for row in reader:
if len(row) == 2:
if row == "truck":
writer.writerow = "car"
It's obvious, you have opened the file in text mode you need rt :
with open(infilename, 'rt') as fp_in, open(outfilename, 'wt') as fp_out:
Also if you want to check the vehicle type you need to check the row[1] which preserve your car name and then reassign it and write the row to your output file.Also note that you don't need to check the length of your rows since calling the len function can be terrible in term if your performance which has O(n) and for large files (specially with large rows) is very inefficient.
infilename = r'external carriers.csv'
outfilename = r'outputCSV.csv'
with open(infilename, 'rt') as fp_in, open(outfilename, 'wt') as fp_out:
reader = csv.reader(fp_in, delimiter=",")
writer = csv.writer(fp_out, delimiter=",")
for row1,row2 in reader:
if row2 == "truck":
writer.writerow([row1,'car'])