develop free website

Processing Big Data

Doctoral Program: PhD in Electrical and Computer Eng.
Lecturer: Cláudia Soares

This page provides support materials (presentations, code, data) produced by students enrolled in IST's  PhD course  "Processing Big Data".  As part of the requirements, students must apply data processing techniques to a specific problem and must present their solution in a 10 minute "spot light" talk and publish a 6 page article peer-reviewed by their classmates. Depending on disclosure agreements (some data is subject to NDA's), their code and data will be made available for public use.

This is an early "experiment" on how to leverage "AI-on-demand" platforms to enhance the production of AI tools, learn and do collaborative work in a  "distributed" community. Through this initiative we aim to test how platforms such AI4EU can help bridging university reasearch to society at large, particularly in the scattered European scientific ecosystem.
As an example, pollution data was shared by Norwegian partners that colaborate with the physical AI Task. Other data was provided by local companies or "scrapped" but final results may become available in AI4EU platform.

PBD's  discussion group :  https://www.ai4eu.eu/group/bigdata-processing (free registration).


Date: Thursday July 16, 2020  10:00AM CET (9:00 AM Lisbon) 
Zoom Video conference: https://videoconf-colibri.zoom.us/j/4591173089

Program

Presentations are  grouped in three thematic areas:
Health, Air Quality-Pollution, Society

Time in CET timezone = (Lisbon + Athens)/2

HEALTH

10:00

Pedro Valdeira - Overcrowding in emergency departments: analysis and forecasting

Emergency departments see variations in patient volume and often ensuing overcrowding. This is a major healthcare issue. In this paper, a dataset of waiting times and the number of patients awaiting emergency care are analyzed. It is of great use to have information regarding the patient volume in advance, allowing this factor to be taken into consideration when planning for the allocation of resources, and leading to a better service.

 presentation code  data

10.15

B. Tavares, D. Teixeira - Evaluation of Drug Consumption Risk

Drug use constitutes a matter of social concern and several risk factors may be associated with the increase in the probability of drug consumption. The problem of assessing an individual’s risk of drug consumption and its relationship to personal traits becomes very important in the field of physical and mental healthcare. In this project, the aim is to evaluate this relationship in a data set that contains several drugs - exploratory analysis and resampling techniques to classification methods were addressed. One main conclusion is that a high N-score, low A- score, and low C-score are the most common personality traits associated with drug use.

 presentation code data

10.30

R. Duarte -  Analysis of the CUF primary-specialty care referrals

In this paper we analyse the CUF primary-specialty care referrals. Here we present a descriptive analysis of the bipartite graph of referrals and we build two models of recommender systems for specialty doctors of a given specialty to patients. The final objective is to analyse if the patterns created with the recommender systems are or not similar to the present referral patterns.

 presentation code data

Air Quality and Pollution Analysis

11:00

J. Torres - Trondheim Air Quality and Traffic

Air quality is a subject of major concern in an industrialized and motorized world. During the course of Processing Big Data, air quality and pollutant levels were dissected and the subject to a thorough analysis. A measure of air quality developed by the United States Environmental Protection Agency (EPA) is air quality index (AQI). By calculating the AQI for different pollutant readings, air quality in the city of Trondheim, Norway, has been assessed. The derived AQI measure and traffic are analyzed to provide better insights regarding the underlying processes generating it. Furthermore, we used a Gaussian Process Regressor (GPR) and tested its ability to predict the AQI through space and time in various points in the city.

 presentation code data

11:15

D. Vicente -  An Air Pollution Prediction Pipeline using Pollutant Data Gathered in Lisbon, 2019

We study the viability of applying matrix completion methods to high dimensional matrices before serving as training data to Gaussian process regression models. The regression model can then be used together with pollution data to predict pollutant concentrations at a considerable physical distance from monitoring stations, predict values in time, or both.

 presentation code data

11:30

M. Almeida, R. Santos - Trondheim Traffic and Pollution Analysis

In recent years, air quality has become an environmental health issue due to rapid urbanization and industrialization. Given the impact that it has on everyday life, how to predict air quality precisely has become an urgent and essential problem. Considering the evidences proving the relationship between traffic and pollution, the goal of our study is to predict air pollution based on traffic values. The data presented was collected in Trondheim, Norway.

 presentation code data

11:45

A. Gonçalves, M. Vieira, F. Caldas - Air quality in Europe and COVID-19

It has been noticed that since the COVID-19 epidemic broke-out the world has changed in many ways. This pandemic has lead most governments to activate their own emergency states, leading to a halt in production, and transportation of people and goods. This change has global implications and some of them are somewhat unforeseen, such as the decrease in pollution, which has been an insurmountable issue for a long time.

 presentation code data

Society

12:00

M. Motta, Mobility Mining – An introductory Twitter-based experience for Lisbon, Portugal

This paper addresses an alternative methodology based on machine learning over social media content to gather and process data for mobility studies. An automatic feature set generation is presented in order to perform user segmentation according to socioeconomic status (SES), and users' activities patterns identification based on reverse geocode processing.

 presentation code data

12:15

L. Serra - Integrating Energy, Technology and Institutions

The objective of this research is to contribute to building a model that allows providing scientific evidence to the conceptual framework for industrial revolutions (FIR), that presents Energy, Technology and Institutions as pillars in societal development, being the keystones to explain economic growth.

 presentation code data

12:30

 R. Taban -  Dealing with Imbalanced Data: the case of marketing scoring models

In marketing, finding potential buyers plays an important in sales cycle. The process of finding potential buyers named Lead Scoring (LS). In this project, we have received a big dataset and request of performing LS on it where the problem turns into a binary imbalanced classification problem. This article focused on building the best possible predictive model and setting on the provided dataset. The results show Random Forest with a combination of Synthetic Minority Over-Sampling Technique (SMOTE) as a balancing technique and Mutual Information Feature Selection (MIFS) as feature selection approach, deliver the best possible predictive model.

 presentation code data