You are currently in guest mode. Log in or register to fully use the HOOU platform.
Your advantages when registering:
- You can use chat and forums to exchange ideas.
- We will confirm your participation in some courses.
QuaXP - Data Quality Explored
Topic outline
-
In this section, you will learn about the structure and usage of the QuaXP - Data Quality Explored project. The section consists of a tutorial that provides information on the course and how to navigate the book. The book is divided into different parts, including an introduction, chapters on numerical, image, and text data, as well as sections on Python tutorials and additional content. The course offers both beginner and advanced levels, with the advanced level including code cells for running and modifying code. Quizzes and practical tasks are also included throughout the course.
-
Welcome to the project QuaXP - Data Quality Explored
You find yourself on the tutorial page. Here you can learn everything about the course and how to use the book.
Good luck!
The book is accessible here: https://www3.tuhh.de/sts/hoou/data-quality-explored/
Tutorial: how to use the book?
The course is supposed to be worked with in autonomy: there is no way of keeping track of someone's progress and results. If you wish to keep your scores, you should write them somewhere locally.
On this page, we present in different sections, the parts the book contains and how they should be used, the two different difficulty levels (beginner and advanced), and their content.
The structure of the book and its pages
The book is to be followed like a classical paper book, with pages, a table of contents, and chapters. At the end of each page, a button allows the user to go on to the next page. It is also possible to use the menu on the left to explore a section and jump to any page:
edu sharing object
Each page starts with a banner containing information about it, such as the title, the goal and the time needed to study the page:
edu sharing object
On the right of the page, a table of contents for the said page is shown, for details about the page itself. By clicking on a title, is is also possible to jump to a specific section.
The different parts of the book
This book is divided in different parts. To fully understand how to work with it and which pages to visit, here is an overview on how each part is meant to be integrated in the learning experience.
The parts are accessible with the menu on the left:
The introduction
The introduction contains some introductory content that can help to understand the course.
If you never heard of machine learning before, and never tried to visualize some data on a graph, we highly recommend you to read the introduction.
This section is meant for both beginner and advanced levels, the introduction to Python targetting only the advanced learners.
What is machine learning?
There, you can learn what "machine learning" means and how it is used in the course. This page is important to understand what the main steps in a machine learning experiment are. It is a good introduction to understanding the place of the data in the whole process.
Supervised learning used in this course
On this page, we dig a little bit more in the details of the specific type of machine learning we use here. You learn how to measure the performance of a machine learning experiment, and how the data is specified depending on its type.
Main types of graphs used in this course
We present here different ways we will use in the course to visualize the data or the results of our experiments. It follows the previous pages as you need to understand what the data and the results mean before being able to apprehend them on a graph.
Specifically, we present histograms, boxplots, and classic 2-d plots of one variable against another.
.csv files - data format used in the course
This short page shows the format used for data files in the course. It is not necessary for the beginner level, but it can be useful to have an idea on how the data look like. For the advanced level, the knowledge of .csv files is necessary.
Get started with Python
This section offers 7 pages showing basic tutorials about Python. If you only want to follow the beginner level, you can skip this part. It is a good starting point for learners who want to follow the advanced level and have never worked with Python or with the Pandas library. Have a look at the different sections and identify what you already know and what you do not.
The appendix
This part is only for curious minds, it contains additional content that is not necessary to understand the course. We present how the data have been collected and transformed before being used in the course, and how the machine learning algorithms that we choose work.
The chapters
The course contains 3 chapters. Each chapter contains several sections that are entirely part of the course, with a quiz at the end of each page.
At the end of the chapter, a practical task is proposed for the advanced level.
Chapter 1: numerical data
This chapter studies how to understand, gauge and deal with data quality for numerical data, meaning, datasets containing numbers. In our case, we took the specific case of AIS data, data sent by a ship along its journey at sea (speed, location, time). We propose several small experiments to understand how a change in the data affects the result of a machine learning experiment.
Chapter 2: image data
We dig into the world of images. How images are treated and understood by algorithms in a machine learning experiment, and how we can modify an image dataset to modify the results of our prediction. We chose a simple problem of object classification for this chapter: we want to build a model that can classify the object present on an image (for example: "is this animal a horse or a cat?").
Chapter 3: text data
In this chapter, we explore how text data can be preprocessed and cleaned and how words can be used by a machine learning algorithm. The example task we use is a sentiment analysis, or opinion mining, using tweets from Twitter. Sentiment analysis gives an idea on how a concept is perceived by the public from tweets, or other texts like reviews, classifying them into positive or negative categories according to their contents.
Beginner level
The beginner level aims to teach data quality to learners with no background in coding or computer science. The course follows the same path as for the advanced level, without the code cells and code instructions.
Prerequisites and target audience
Some logic thinking is still needed to understand the beginner level, as we discuss about data, statistics and graphs. But everything is explained in time.
The beginner level targets every curious mind who would come accross this teaching offer and would like to learn more about machine learning and data. Curiosity is necessary, as it is not an easy topic, but we aim to make it as accessible as it can be.
Summary videos
Summary videos at the beginning of each section aim to give a first glance of the addressed topics in the part. After watching those videos, you should be ready to start with the course itself.
Quizzes
Inside the course, quizzes are proposed at the end of each page to test whether the page was well understood. Thoses quizzes are a way of self-learning: feedback is given for each question but nothing is mandatory.
Visualizations
The visualizations (graphs, images, ...) for the beginner level are the same as for the advanced level. Most visualizations are static but sometimes, an interactive widget allows you to play around with parameters to see how the results are affected. Don't be shy with those widgets, even if they should display an error: there is no error that can make your computer crash.
A jump to the next level?
For the most brave learners, there is always the possibility to shift from beginner to advanced. The only difference between the two levels is additional information about code and the possibility to run code cells.
A tutorial to get started with Python is available in the introduction. Why not give it a try?
Advanced level
The only difference between the beginner and the advanced level is the addition of code cells and code instructions.
To see those cells, click on the button "click to show" on the right:
Code cells
The code cells integrated in the pages are the most important feature for the advanced level. They allow you to run and modify code directly in the book. Don't be shy to play around and modify them as much as you want: even if you should get an error, nothing is fatal for the book. Creating errors can also be part of the fun.
Tu run the cells, click on the rocket logo on the top right of the page (this logo only appears on pages that contain code cells).
edu sharing objectThen, two options are possible:
- Select "Live Code". This allows you to run and modify the code cells directly on the page. It may take some time to get ready, so I suggest you start it as soon as you are on the page.
- Select “Binder”. This will redirect you to another page: the page of the book will be open with Jupyter Notebook in your browser. You might see some hidden content in this version, do not pay attention to those cells (the hidden cells are marked as “cell to hide”). At the end of the page, you need to return manually to the book to see the menu and switch pages.
IMPORTANT: within one page, the code cells have to be run in a top-down order. If a code cell uses variables or libraries that were assigned in a previous code cell, and this previous code cell has not be run, you will get an error.
When you leave a page, all the progress is lost and if you come back on this page, you have to re-run the code cells.
Summary videos
Those videos at the beginning of each section give a summary of the addressed topics in the part.
For advanced learners, they can be used in several ways: either to test whether or not you already know what the chapter deals with, in that case, you can try to answer the quizzes directly and do the practical task to test yourself first. They can also help to remember the main points listed in each part, for example after having worked with the book, to refresh the knowledge, etc...
Quizzes
At the end of each page, quizzes about the content of the page are proposed. Those quizzes do not contain any coding questions and are more thought of as a test and a refreshment about the concepts of the course.
Practical tasks
At the end of each chapter, a practical task is proposed. There, you can apply on new data and examples the coding concepts you learned during the course.
-