Tags:data analysisMicrosoft Power BI

Nobody is born as a data analyst, and most people are indifferent to columns of digits. But if you discover an inclination to finding evidence hidden in numbers, you will need proper tools to make it happen.

Your first choice will probably be Microsoft Excel. And if you have a clue about programming, you can also rely on languages like R or Python. But what if you're not technologically skilled, but would like to use something more advanced than a spreadsheet?

Today, I'd like to introduce you to Microsoft Power BI Desktop (the 'BI' stands for 'Business Intelligence') which connects these two worlds - it has an interface not too distant from Excel, but also enables you to design complex data loading processes and enhanced built-in transformation features with snippets of real code.

This tool, designed for advanced data analysis, is available for free - everyone can try it at no cost. Microsoft also sells the Pro edition with advanced workgroup features, but we won't need it today.

Installation

The easiest way to get Power BI is through this link: https://aka.ms/pbidesktopstore

It will redirect you to the Microsoft Store app. If you prefer the standalone installer, it is available here: https://aka.ms/pbiSingleInstaller


[Link]

Working With BA Tools

The typical flow of working with data looks as follows: first, you load the data, then fix data types and column formatting, then remove invalid rows and clean the data, finally perform the analysis on the prepared data set.

In this article, we'll analyze logs of a website running on WordPress. Logs in the Apache webserver were chosen deliberately since we want to show Power BI's superiority over Excel - working on such a format in a spreadsheet would be tough.

A single line of the file looks as follows:

159.69.121.51 - - [25/Jan/2021:18:05:33 -0800] 'GET /kanarek-czyli-system-poznego-ostrzegania/ HTTP/1.1' 200 22310 '-' 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 5.1; Win64; x64; Trident/5.0)'

Here is the description of successive fields (more information in Apache documentation):

159.69.121.51 - IP number of the client host
/- -/ - two unused fields replaced with minus sign
[25/Jan/2021:18:05:33 -0800] - date and time of the request
'GET /kanarek-czyli-system-poznego-ostrzegania/ HTTP/1.1' - requested URL
200 - http response code
22310 - number of transmitted bytes
'-' - referer
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 5.1; Win64; x64; Trident/5.0)' - pole identyfikujące przeglądarkę użytkownika

We load the data with command File → Get Data → Text/CSV. Afterwards, the following dialogue window appears:

[Link]

Up to this stage, the import window looks very similar as in Excel:

[Link]

However, where Excel ends the processing import, Power BI beginning to rock. Click the 'Transform data' button, and the following Query Editor will appear:

[Link]

Transforming Input Data

In the Query Editor, we'll make all the necessary transformations. Every step becomes a separate entry in the 'Applied Steps' pane, therefore we can experiment freely - we can always revert to the last transformation step.

I recommend switching to the 'View' tab and enable 'Column quality' and 'Column distribution' options - it will make any potential easier to spot. Enabling 'Formula bar' will help to ensure that a given transformation step indeed does what we intend it to do.

[Link]

We'll start with a few basic operations - let's delete columns 2, 3 and 11 (right-click on the column header, command 'Remove'). Then we'll merge columns 4 and 5 (highlight two columns while holding the Shift key, right-click on the header and 'Merge columns'). Now we can highlight columns 7-8 and change the data type to numeric (header, right-click, 'Change type' → 'Decimal number').

Next, it's time for a more advanced transformation - we'll change the column holding date and time of the request to the proper temporal type. Make notice, that the webserver was located in California, which is confirmed by timezone +8 UTC.

In the beginning, we'll use the function 'Replace Values' twice, in order to remove the left and right square bracket (replacing '[' with empty value will make it disappear). However, if we'll try to change the column type to 'Date/Time/Timezone') immediately after that, the whole column will be filled with errors. After a few tries, I discovered that the date must be separated from time with a space, not a colon. Therefore, the third replacement should change from '2021:' to '2021 '.

[Link]

To finally achieve the 'Date / Time' type, we need to switch the column to 'Date / Time / Timezone' first, and only after that to 'Date / Time' (necessarily as a new step). Why such complications? Well, the graphical user interface covers some unspecified assumptions, some of which are not immediately obvious. But in the end, we were able to reach our goal. We can also see that the mentioned column now displays time within our local timezone.

As the next transformation step, we'll extract some additional variables from the HTTP request column. Click the right mouse key on the header and 'Split columns' → 'By delimiter'. Parameters: 'space' text in Combobox, left-most delimiter, accept - we now extracted request type (i.e. GET). Let's do it again, this time with the right-most space delimiter - we now also have the protocol version (i.e. HTTP/1.1). Finally, the third split with the character '?' (asterisk) as a right-most delimiter - this way the URL parameters are now present in a separate column.

It will be enough for now. The last thing we do is changing column names to more user-friendly. From left to right: ip, timestamp, request, url, parameter, protocol, status, length, referer (do not correct typo, it's present in HTTP spec for 25 years) and useragent.

Now we can end working with Query Editor by clicking 'Close & Apply' in the 'Home' ribbon tab. You can always go back to that window with 'Transform data' on the main Power BI window.

Cleaning The Data

In the final analysis, we'd like to learn something about the reader's behaviour - but the web server logs contain too much garbage. We need to filter out requests performed by WordPress automation, search engines, RSS readers and so on. We'll switch between Report view and Data View to determine the required filters set. Of course, we'll also profit from some knowledge of the communication protocol which is used by the browser while talking to the web server.

Let's start by adding any bar chart from the Visualizations pane. In the working area, a chart template will appear. Let's highlight it and then drag 'request' from 'Fields' pane to 'Axis' and 'Values' properties. We should see something like this:

[Link]

We can see that POST requests make a third part of the whole set, which is kind of surprising. Let's switch to Data view (middle 'grid' icon in 3-icon set at the left side of the screen). Click the filter icon in the header of the 'request' column and allow only POST values.

[Link]

It's obvious, that most POST requests are issued by WordPress internals. We only want to analyze GET requests, the ones used in 'normal' web browsing. Switch back to Report view (top icon of mentioned three), then drag the 'request' field to frame 'Filters on all pages' on the 'Filter' pane. After that limit the visibility to GET requests.

[Link]

We still want to filter out web crawling bots. How can we do that? Check out the 'useragent' column, it's the browser identification string sent by browsers and also by bots. Let's place the Table visual in the working area and drag the field 'useragent' there.

[Link]

In that column, we have over 600 different values. Let's change the visualization type to a bar chart and sort it by the number of occurrences. You will find sorting options under the three dots visible after selecting the given visual.

[Link]

The Google bot (2nd item) has visited this web page over 1300 times during one day. How to deal with such log entries? Let's build another column (field) keeping the operating system of the user's computer. Anything we won't recognize will be considered a bot.

Let's go back to Query Editor (button 'Transform data'), select tab 'Add column' and click 'Conditional Column' button. We will define a new column as follows:

[Link]

Remember - if you make a mistake, you can always go back to this window by clicking the small cog icon near the last item of 'Applied steps' pane. Now, let's save the changes and go back to main to the main window to check the ratio of newly calculated values.

[Link]

It turns out we identified bots correctly - a manual inspection revealed dozens of RSS readers, web crawling robots and various mechanisms responsible for preparing link previews in social media and instant messengers.

We need to speed up a little bit - we also eliminate request with response code other than 200, filter out Javascript and CSS, and also images and multimedia. Unfortunately, we can spot some inconsistencies of different parts of Power BI. Example - filtering by text works differently in Data view (left pic) and Report view (right pic). Filter option 'ends with' is available only in the former.

[Link]

To create a dashboard, we'll also need:

  • A field keeping the hour of request (0-23), we will create as 'Custom column' with the formula '=Time.Hour( DateTime.Time ([timestamp] ))'
  • Grouping in the column describing operating systems, as below (desktop and mobile browser groups)

[Link]

Visualization

The main advantage of Power BI over Excel is the possibility of doing interactive analysis right after the dashboard is created. Power BI allows you to instantly filter and highlight values and visualize dependencies between variables. Let's create a report with the following visuals:

  • Slicer, fields 'os' and 'os (groups)'
  • Table, field 'protocol'
  • 100% stacked column chart, axis 'hourofday', legend 'os (groups)', values 'os TODO'
  • Line chart, axis 'hourofday', legend 'os (groups)', values 'os TODO'
  • Stacked bar chart, axis 'url', values 'number TODO', sort descending by values

If you want to learn how many people read the article about emoticons, between 20 and 21, using a mobile phone, you only need three clicks:

[Link]

The target value is presented on the tooltip when the user points proper bar - across 34 visits, the mentioned article was loaded 4 times.

Clicking parts of visuals automatically filters or highlights the other visuals. You can achieve a similar effect in recent versions of Excel using Slicer or Pivot table, but filters will be a separate visual; you can't enable/disable the filter by clicking a selected element of the chart.

Of course, the described behavior is optional - using function 'Format' → 'Edit interactions', you can selectively enable and disable relationships. It is possible to build a dashboard with two isolated sets of visuals and as an effect be able to compare two different filters side by side.

Summary

In this article, I presented how to use Microsoft Power BI Desktop to analyze web server logs. Working on raw logs was rather laborious, but luckily, the result dashboard can be re-used (to load another data set, simply use the function 'Transform Data' → 'Data source settings' → 'Change source…').

Yet, it's important to remember, that although BA tools are more powerful than Excel, and are better at transforming and adjusting data, for some uses they're still not enough.

If that is the case in your project, you'll have to learn programming - but you can also… call PGS Software We'll gladly provide you a turn-key solution tailored specifically to your needs.

And did I mention that with Machine Learning, you're not only able to explain past events, but can also… predict the future? To learn more, book a free, no-strings-attached consultation with me or one of our experts, and we'll tell you more about it!

Attachments

  • Original document
  • Permalink

Disclaimer

PGS Software SA published this content on 17 March 2021 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 14 April 2021 07:40:03 UTC.