You are currently in guest mode. Log in or register to fully use the HOOU platform.
Your advantages when registering:
- You can use chat and forums to exchange ideas.
- We will confirm your participation in some courses.
QuaXP - Data Quality Explored
Topic outline
-
In this section, you will learn that a basic understanding of data science is important to comprehend decision-making processes. An accessible online course on machine learning is designed for different difficulty levels. You will learn how to detect and fix errors in data, with a focus on numerical, image, and text data. The course utilizes Jupyter Book as an interactive platform and does not require programming knowledge for beginners. However, advanced learners can experiment with coding. The course is developed by the Institute for Software Systems (STS) at TUHH.
-
About the course
Did you know that most of the hiring processes in big companies are complemented by some machine learning algorithms, in charge of pre-selecting the best resumes before handing them to a human? (1) That some banks use machine learning to predict whether or not you will be able to reimburse a credit? (2) That most online customer support usually start the interactions with an intelligent agent rather than a human? (3) Because of the vast amount of applications we are confronted to in our daily life, a basic knowledge of data science is essential to understand how decisions are made and what we, as a member of society, can do to maximize the efficiency and minimize the risks of these processes.
That is why public awareness on the field of computer science technology is of great importance, and science enthusiasts should not be the only ones to be informed about these challenges and pitfalls. Education should offer learning opportunities to a broad audience, by creating resources that are understandable with no to little scientific background. With that in mind, we want to design an online course about machine learning that is accessible for everybody, with different levels of difficulty.
Is it for me?
This course addresses whoever is interested to learn more about how data is used in general in nowadays technologies.
Because the course is available as two different levels, beginners in science and programming are more than welcome: the only prerequisite is curiosity in the topic, and ability to understand basic logic and science. No coding is required: you can play around with the data using only sliders and visual interfaces.
For those who like a little more challenge, the advanced level is here: the content is the same as for the beginner level, except that the manipulations are made with coding. For this level, you need to have some basics in programming, but don't worry: we provide in the introduction some tutorials for beginners who are interested in coding. In general, Python is an easy-to-learn programming language, and in the scope of this course we use it at a light level.
No basic knowledge of machine learning is required here, as we also provide an introduction on the topic. The course focuses on data quality, so the machine learning tasks are kept very simple in this course.
Topics covered by the course
You will learn to detect some errors in data, how they can affect the task they are used for, and how to "repair" them, or mitigate the consequences of bad quality in data.
As data quality can vary much depending on the type of data, we chose to focus on 3 different types of data:
- numerical data: data containing mostly numerical values; the simplest type of data to be processed, in this case logistics data (such as AIS data).
- image data: a representation of the pixels constituting an image.
- text data: a harder task for evaluating data quality, where the information is expressed in natural language (sentences).
Design of the learning arrangement
The course is developed using Jupyter Book, which provides an interactive interface, mixing text content with code cells.
For the beginner level, the code cells are hidden and the interaction happens through text, graphs, sliders, text boxes, and some quizzes come regularly for you to make sure that you understand what you are reading. Everything is online and you do not need to install anything on your personal computer.
The advanced level can use the interface of Jupyter Book at its fullest: the code cells can be modified, run and re-run directly on the page. Here again, everything happens online. However, if you want to do your own experiments, it is also possible to download the content and the datasets on your personal machine.
Prerequisite needed
For the beginner level, no prerequisite is needed, besides curiosity. Of course, without being curious about the topic, it will be hard to follow! But with the appropriate attention, the course is accessible for everyone. In the introduction, we added some content to help you prepare for the course, in case you are missing some basics in science. Good luck!
For the advanced level, it is theoretically possible to also start from scratch, as we propose some tutorials for Python in the introduction. A small background in computer science and coding is however helpful to understand the code presented in the class.
Context of the project
The project is developed by the Institute for Software Systems (STS) at the TUHH, under the supervision of Prof. Sibylle Schupp.
The STS institute specializes in the following research fields: model checking and abstract interpretation, software quality and verification, and data protection and machine learning.
The collaborators of QuaXP previously worked on a related project: MaLiTuP (Machine Learning in Theory and Practice), a collaboration between the Institute of Maritime Logistics (MLS) and the STS institute, which successfully developed a beginner course in machine learning for students in maritime logistics. With this project, we gained experience in teaching with Python and Jupyter Notebook, as well as in designing lectures and tasks for beginners in machine learning.
-