2022 Semester 1 Fortnight 5 – Data & Networks

Learning Goals

Humans are visual. Data visualisations enable any layman to have a grasp of the story the data is telling. It makes for a powerful and effective tool to present and analyse data. One of the projects I worked on at my current work required us to engineer customer and service data for the visualisation team to present through Tableau. Tableau not only presents the data, but it allows the viewer to interact with the graphs, charts and plots.

One of the most important skills as a practitioner of the new branch of engineering is to be able to translate my question-asking skills into different contexts and media. I would have to be able to frame the questions that I am asking of the data in a way that can be explored, investigated, and interrogated using data tools. I look for 3 things.

  1. By mapping and graphing, the patterns in the data.
  2. Gaps in the data
  3. Any outliers

The chosen tool is Tableau which is used widely in the industry. The goal here is to be sufficiently comfortable with Tableau, ask better questions about the data available, and tell a coherent and useful story to the stakeholders who are most likely going to be laymen who do not know much about data analytics. These stakeholders could be potential customers of the firm’s service who need to be sold on what changes the delivery of the service could potentially make, government officials who need to make sound policy decisions, and the senior decision-makers who need to make high-risk decisions within a short timeframe. This means effective data visualisation can serve an organisation to minimise cost, time and make evidence-based decisions.

In fortnight 2, while working on Tic TacToe assignment, I discovered the following artist. Refik Anadol’s turned machine learning and data visualisations into a genre of art. This really opened my eyes to the possibility of data wrangling and visualisation. The story and message his work tells the audience has different characteristics than those of data analytics in corporate environments. I thought implementing this generative art onto the screen of my maker project, Two-way mirror would be a great idea. But in the end, implementing the outdated and multi-language magic mirror libraries was time-consuming enough so I did not implement this. But certainly, this is a brilliant way of visualising complex data into a cyber-physical system.

https://colab.research.google.com/drive/11nmEUYW338-oclVUZM1s2LNOZgbHSbNy?usp=sharing For a start, I followed the videos in this colab notebook. This was helpful on top of the in-class tutorials as I could watch them at my pace.

Tableau is designed as a software as a service (SaaS) application, which means that instead of ‘opening’ or ‘importing’ data, Tableau asks you to ‘Connect’ data to its engine.

In the video, Johan mentioned that a good first step after importing data is to look and see if there are data points you might expect to see. How might you find out what a data set ‘should’ contain?

  • How are you transforming the data? How might others have transformed the data?
  • How inscrutable is the data you are working with? (That means how easy is it to understand where it comes from, and whether or not the data has been transformed before)
  • How reliable are your findings/transformations?

Wikipedia explains that data wrangling is the process of transforming data from one format into another so that it is “more appropriate and valuable” for further use. What does that mean? And who decides what is more appropriate and valuable? When did data wrangling even become a ‘thing’?

In the Date Time column, even though I clicked on the order by date button, since the data type is string, Tableau is not treating the column as dates.
Notice to change the data type of the Date Time column, the user needs to go to Sheet 1 tab.
Go back to the Data source tab and sort the Date Time column again. It is correctly sorted from the latest to the oldest date.
Notice the GPS column. This GPS coordinates do not change over time. However, to plot it in Tableau, we need to separate the GPS column into Latitude and Longitude columns. The standardised way of writing GPS is Lat – Long. If you are confused, look up the location on Google map and find out.
Tableau makes this separation quite easy too by having the transform button as you can see in the above screenshot.
As you can see above, the data types of GPS – Lat and GP – Long are still string. Now I will change them to numbers decimal by clicking on the Abc button and clicking on numbers decimal.
Having Lat and Long as numbers decimal is great but Latitude and Longitude are actually their own data types. So we select these under Geographic Role. Tableau really makes an effort to make any data transformation easy and seamless.
And now you can see a little globes here next to GPS – Lat and GPS – Long. You can also observe for the columns Aqi Co and so forth, Tableau inferred they are numbers.
Going back to the Data source tab, you can tell GPS – Lat and GPS – Long are created by the user due to the fact they do not have the blue bars at the top and also by the little ‘=’ sign before the data type (globe) icon. This is how Tableau differentiates the original columns and computed columns.
I am examining the Pm2.5 data by dragging it into the row section in Sheet 1. But as you can see since the data is the sum of Pm2.5, it does not make much sense at all. I then will break the Pm2.5 data over time as follows.
To break the data over a year-long time period, I dragged the Date to the column above. Still, this graph does not make much sense, it is just displaying the sum of Pm2.5 value over a period of the years. Now, I will turn the Pm2.5 value into its average, and divide the years into months.
I have changed the shape of the graph into Pie in the Marks dropdown button to see the outliers. As you can see above, the outlier was caused by the Canberra bushfire.
Since Pm2.5 is not the only air pollutant, I am going to examine the level of carbon dioxide as well as the above. To examine and make sense of the above patttern, now I am going to vary the colours of the dots by the location where the data was collected. We knew from the data source that the data were collected in three different locations.
I can do this simply by dragging the Name column which represents the names of the location onto the colour button under the Marks section.
Above is the animated visualisation of average Pm2.5 level in 3 suburbs in Canberra since January 2019 by month till May 2022.
I learned how to have the map show just by placing GPS – Long onto Columns and GPS – Lat onto Rows.

Then I can drag any column of the data onto the square buttons in Marks section.

Note that if you want to have the labels of AVG(Aqi Pm2.5) and also have their colour varied, you would need to have dragged AVG(Aqi Pm2.5) dragged each time (twice). Then you can change the colour coding and its range as well by signifying how severe each level represents. However, in this process, the data visualiser must understand the subject matter thorougly to gauge and represent appropriate level of severity.

Questions for the new branch of engineering

  • What are the important responsibilities of a data analyst/data engineer?
  • How much does the data analyst understand the subject matter of the data that they are dealing with?
  • What is the difference between data profiling and data mining?
  • What is KNN imputation method and how does it work?
  • What is the best way to deal with missing or suspected data technically and how do you best communicate it?
  • What are the different data validation methods and on which case are they suited best?
  • How do you best detect outliers and deal with them technically, and how do you best communicate it?