DATATHON

Διαδικτυακή εκδήλωση

28-06-2021 έως 30-06-2021
ΕΚ "Αθηνά"

Αρχική Εγγραφή Συμμετεχόντων Εγγραφή Μεντόρων


Συνδεθείτε εδώ στις 28 Ιουνίου στις 15.30 για να παρακολουθήσετε την έναρξη του datathon και να θέσετε τις ερωτήσεις σας.
Την Τετάρτη 30 Ιουνίου  στις 13.30 παρακολουθήστε εδώ τις παρουσιάσεις
 των διαγωνιζόμενων και τη βράβευση των νικητών.    

Βράβευση Datathon 

Kύρια πρόκληση του συγκεκριμένου Datathon είναι η αξιοποίηση καινοτόμων μεθόδων ανάλυσης δεδομένων σχετικά με το πώς διαδίδονται οι ψευδείς ειδήσεις μέσω των κοινωνικών δικτύων, η αξιοποίηση σύγχρονων εργαλείων και μηχανισμών “εξόρυξης" δεδομένων για επιστημονικές  περιοχές ή/και  δημοσιεύσεις, καθώς και η αξιοποίηση καινοτόμων διαδραστικών μεθόδων απεικόνισης.

Στη διάρκεια των δύο ημερών του Datathon, οι συμμετέχοντες θα διαγωνιστούν αξιοποιώντας μεθόδους και εργαλεία και δουλεύοντας πάνω στις παρακάτω προκλήσεις/προβλήματα:  
 
  1. Διάδοση ψευδών ειδήσεων στο Twitter (Fake news diffusion in Twitter)
  2. Επιστημονικός αντίκτυπος με βάση τα επιστημονικά πεδία/επιστημονικά περιοδικά (Field/Journal-based Scientific Impact)
  3. Οπτική Αναλυτική Δεδομένων για δεδομένα κινητικότητας (Visual Analytics for Mobility Data)
Το αποτελέσματα αναμένεται να περιλαμβάνουν λύσεις για την αντιμετώπιση ενός ευρέος φάσματος θεμάτων στις παραπάνω προκλήσεις, όπως:
  • Πλαστά δεδομένα ειδήσεων από τον ιστότοπο Greek Hoaxes με αντιστοίχιση με τα δεδομένα Twitter, για εξαγωγή γραφημάτων διάδοσης tweet ανάδειξης ψευδών ειδήσεων
  • Γραφήματα διάδοσης ψευδών ειδήσεων στο Twitter για την εξαγωγή μοτίβων για τον εντοπισμό πλαστών tweets 
  • Κατάταξη δημοσιεύσεων/περιοδικών με βάση την απήχηση, χρησιμοποιώντας δεδομένα / μεταδεδομένα και λαμβάνοντας υπόψη το επιστημονικό πεδίο 
  • Προτάσεις για νέους δείκτες απήχησης για περιοδικά / δημοσιεύσεις
  • Οπτική Αναλυτική Δεδομένων για Πρόβλεψη Κυκλοφορίας
  • Οπτική Αναλυτική Δεδομένων για “βρώμικα” δεδομένα
Αν είστε ομάδα επιστημόνων, ερευνητών, φοιτητών από διάφορους κλάδους, και κυρίως πληροφορικής, βιβλιοθηκονομίας, Επικοινωνίας και ΜΜΕ/δημοσιογραφίας, γίνετε μέλος αυτού του καινοτόμου Datathon για να εφαρμόσετε τις γνώσεις σας, να πειραματιστείτε σε νέες περιοχές γνώσης, να αποκτήσετε νέες δεξιότητες! Σημαντικά βραβεία περιμένουν τους νικητές!
 
 

Προκλήσεις

Challenge #1 - Fake news diffusion in Twitter
Category
Fake news, Social networks
Short Description
The target of this challenge is to show how fake news propagates through social networks. Fake news data from the GreekHoaxes website will be matched against Twitter data to extract tweet propagation graphs that demonstrate the fake news diffusion.
(Sub-)tasks
An important challenge in current research is the identification of fake news, as it appears in digital media. Several methods have been proposed that are based on machine learning techniques and use various characteristics of the text to identify fake content. Such fake content usually propagates in social media like Twitter, as users retweet the content of the original tweet to their followers. Some of the proposed methods use the Twitter propagation graphs of fake news to extract patterns that are subsequently used to identify fake tweets that follow similar diffusion patterns.
In this challenge the objective is to produce a Twitter diffusion graph for a fake news story.
 
The data that will be provided is:
  • A fake news story mentioned in the GreekHoaxes (https://www.ellinikahoaxes.gr/) website.
  • A corpus of tweets downloaded from Twitter in JSON format that contains tweets relevant to the given fake news story.
  • A sample format of the desired propagation graph output.
The desired outcome is a graph whose nodes correspond to Twitter users, and edges indicate retweets of tweets relevant to the given fake news story.
 
Participants are expected to provide a detailed step-by-step roadmap that will describe how to produce the propagation graph from the given input. This roadmap may include a mixture of pseudocode, code, selection of tools and instructions on how to use them, diagrams, and whatever else is needed to specify the solution completely and accurately. Using this roadmap, a coder should be in position to implement the solution without having to make any technical design decision.
 
Note. The roadmap should address the general case, where the fake news story is not known in advance. In this general case, fake news stories are scraped from a website such as GreekHoaxes and for each story a diffusion graph is produced.
Potential Considerations
[1] Agichtein, E., Castillo, C., Donato, D., Gionis, A., & Mishne, G. (2008). Finding high-quality content in social media. WSDM.
[2] Abel, F., Hauff, C., Houben, G., Stronkman, R., & Tao, K. (2012). Semantics + filtering + search = twitcident. exploring information in social web streams. HT.
[3] MacEachren, A.M., Jaiswal, A.R., Robinson, A.C., Pezanowski, S., Savelyev, A., Mitra, P., Zhang, X., & Blanford, J.I. (2011). SensePlace2: GeoTwitter analytics support for situational awareness. 2011 IEEE Conference on Visual Analytics Science and Technology (VAST), 181-190.
Recommended technologies and tools
The participants could consider the following tools:
  • Python
  • Spark/PySpark
  • HDFS
  • Twitter API
  • MongoDB
  • Apache Solr

This challenge is supported the projects "Moving from Big Data Management to Data Science" (MIS 5002437/3) and by VisualFacts (#1614), funded by the Hellenic Foundation for Research and Innovation - 1st Call of Research Projects for the support of post-doctoral researchers.

Challenge #2 - Scientific Impact Measures
Category
Citation Networks, Scientometrics
Short Description
The target of this challenge is to produce datasets that contain scientific impact data for scientific venues or papers , focusing on specific challenges. Publication data from Crossref will be combined with BiP! DB data, to calculate the proposed scientific impact measures.
(Sub-)tasks
Due to the constant increase of published scientific papers, an important challenge in current research is to identify those with the highest impact among them. Several methods to rank papers, as well as authors, and journals, based on scientific impact have been proposed in the literature, most of them focusing on the use of citation counts. However, relying on citation counts has inherent drawbacks: for example, citation patterns differ between various scientific fields, rendering the comparison of papers based on them difficult. To this aim, a number of field-normalized indicators of impact have been proposed, such as the Field Weighted Citation Index (FWCI). However, further drawbacks remain, e.g., recent publications with no, or few citations cannot reliably be assessed based on their impact. Additionally, a number of indicators of impact based on the venue of publication, such as the Impact Factor, or the Eigenfactor have been proposed. These indicators, have in turn been criticized due to being heavily influenced by few highly cited papers in each venue.
Distinct aims of this challenge that may be tackled are the following:
  1. Classify papers by field based on their data/metadata and calculate field weighted indicators
  2. Propose / calculate novel field weighted indicators based on measures other than citation counts
  3. Produce venue-based impact indicators
  4. Propose novel venue-based indicators which are based on paper impact measures other than citation counts, and may avoid shortcomings of existing ones.  
The data that will be provided is: 
  • The Crossref publication data  (> 100 million papers with metadata)
  • Bip! DB impact scores (https://doi.org/10.5281/zenodo.4474163)
The desired outcome is an open dataset of either paper or venue impact measures, along with the software to produce it. Alternatively, participants may provide a detailed step-by-step roadmap that will describe how to produce the data from the given input. This roadmap may include a mixture of pseudocode, code, selection of tools and instructions on how to use them, diagrams, and whatever else is needed to specify the solution completely and accurately. Using this roadmap, a coder should be in position to implement the solution without having to make any technical design decision.
Potential Considerations
[1] Kanellos I, Vergoulis T, Sacharidis D, Dalamagas T, Vassiliou Y. Impact-based ranking of scientific publications: a survey and experimental evaluation. IEEE Transactions on Knowledge and Data Engineering. 2019 Sep 13.
[2] Vergoulis T, Kanellos I, Atzori C, Mannocci A, Chatzopoulos S, Bruzzo SL, Manola N, Manghi P. Bip! db: A dataset of impact measures for scientific publications. InCompanion Proceedings of the Web Conference 2021 2021 Apr 19 (pp. 456-460).
[3] Lariviere V, Sugimoto CR. The journal impact factor: A brief history, critique, and discussion of adverse effects. InSpringer handbook of science and technology indicators 2019 (pp. 3-24). Springer, Cham.
[4] https://www.snowballmetrics.com/wp-content/uploads/snowball-metrics-recipe-book-upd.pdf
Recommended technologies and tools
The participants could consider the following tools:
  • Python
  • CrossRef API

This challenge is supported the projects "Moving from Big Data Management to Data Science" (MIS 5002437/3) and by VisualFacts (#1614), funded by the Hellenic Foundation for Research and Innovation - 1st Call of Research Projects for the support of post-doctoral researchers.

 

Challenge #3 - Visual Analytics for Mobility Data
Category
Data visualization, Visual Analytics 
Short Description
The target of this challenge is to analyze the NYC Yellow Taxi Trip Dataset (Taxi) [1] and provide novel interactive visualization methods and visual analytics for the following analytical tasks.
(Sub-)tasks 
Participants are supposed to address the following tasks, however other processing and pipelines can be considered as well.
Task 1. Visual Analytics for Traffic Predictio
  • Visualize Taxi records on a geographic map. 
  • Analyze the taxi pick-up data to predict traffic for specific "areas", given a future day and time. 
    • Areas can be split in tiles using the "H3 Uber Grid Splitting" method [2][3], or your preferable approach.
  • Propose and implement innovative techniques, and visually represent "information" regarding traffic prediction based on taxi historical data in the specified areas. As naïve solution, different colors and charts can be used to represents the traffic over the areas and for different features, e.g., day of week, time of day, POIs in the area, taxi company, no of passengers. 
Note. The data prediction and visualization are considered to be used in interactive applications, so real-time response is required. The focus of this task is on novel visualization methods that will allow users to use\run well-known prediction algorithms (e.g., regression) and interact with the predicted results.
 
Task 2. Visual Analytics for Dirty Data 
Taxi and Other Mobility companies aggregate data from various sources, ending in having dirty\duplicate entries related to trips.  A common task in such cases involves an analyst which wishes to analyze the dataset w.r.t. data quality. The use of visual techniques may reveal information (e.g., correlations, patterns) which are not easily captured by traditional (non-visual) methods.  In our case, visually analyze "information" related to duplicate entries, will assist the analyst to recognize data patterns, values, or specific attributes, where duplicates records appear. Beyond the insights related to data quality, using these insights will enable the analyst to improve the effectiveness of their duplication techniques.
  • Visualize duplicate records on a geographic map.  
  • Propose and implement new visual methods to visually represent duplicates records on a map. As an example, you may connect the duplicates records with lines. For example, you can see the duplicate information presented in RawVis tool [1].
  • Propose and implement techniques that visual provide "information" and present statistics related a set of duplicates records. So, a user can gain insight regarding duplicate records characteristics, e.g., which are the most common attributes where duplicates values appear.
Note. The duplicates records will be given to the participants beforehand.
Note. The data prediction and visualization are considered to be used in interactive applications, so real-time response is required. 
Potential Considerations
[1] NYC Yellow Taxi Trip dataset (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), are CSV files, containing information regarding yellow taxi rides in NYC. Each object refers to a specific taxi ride described by several attributes, such as pick-up location, trip distance, payment type, passenger count, tip amount, etc.
An example of Taxi data visualization on a map can be found in RawVis system (http://rawviz.imis.athena-innovation.gr), which has been implemented to visualized raw data, e.g., csv.
[2] https://eng.uber.com/h3/
[3] https://eng.uber.com/visualizing-city-cores-with-h3/ 
[4] Taxi dataset has been generated by aggregating data from several data sources. As a result, there are a large number of records (rides) that are (almost the) same. For example, two same rides may have different level of precision at latitude/longitude values. 
Recommended technologies and tools
The participants are encouraged to use either 
  • RawVis API (https://github.com/VisualFacts/RawVis)  to implement the required tasks, and enrich the existing front-end functionality provided or 
  • choose to use other front-end technologies and visualization libraries, e.g., D3.js, Leaflet, Highcharts, chart.js, Apache ECharts.

This challenge is supported the projects "Moving from Big Data Management to Data Science" (MIS 5002437/3) and by VisualFacts (#1614), funded by the Hellenic Foundation for Research and Innovation - 1st Call of Research Projects for the support of post-doctoral researchers.

             

Επιτροπές

ΕΠΙΣΤΗΜΟΝΙΚΗ ΕΠΙΤΡΟΠΗ Datathon Ημέρες Δράσης
  • Θοδωρής Δαλαμάγκας, Διευθυντής Ερευνών, ΙΠΣΥ, ΕΚ "Αθηνά"
  • Γεώργιος Παπαστεφανάτος, Κύριος Ερευνητής, ΙΠΣΥ, ΕΚ "Αθηνά"
  • Γιάννης Σταύρακας, Διευθυντής Ερευνών, ΙΠΣΥ, ΕΚ "Αθηνά"

Αρμοδιότητες: Καθορισμός Προκλήσεων - Προβλημάτων προς Επίλυση - Μότο - Προφίλ συμμετεχόντων 

ΕΠΙΤΡΟΠΗ ΕΠΙΛΟΓΗΣ Datathon Ημέρες Δράσης (TBC)
  • Καθηγ. Ιωάννης Ιωαννίδης, τ. Γενικός Διευθυντής ΕΚ "Αθηνά"
  • Καθηγ. Αντώνης Δεληγιαννακης, Σχολη ΗΜΜΥ Πολυτεχνείο Κρήτης
  • Θανάσης Βεργούλης, Επιστημονικός Συνεργάτης ΕΚ "Αθηνά"

Βραβεία

1ο Βραβείο: Τριήμερο ταξίδι στη Θράκη για όλη την ομάδα των νικητών!

Ξενάγηση στις εγκαταστάσεις του ΕΚ “Αθηνά” στην Ξάνθη και την Αθήνα, Mentoring για έναν (1) μήνα

2ο Βραβείο: 1 Laptop 

Ξενάγηση στις εγκαταστάσεις του ΕΚ “Αθηνά” στην Αθήνα, Mentoring για έναν (1) μήνα

3ο Βραβείο: 1 tablet

Ξενάγηση στις εγκαταστάσεις του ΕΚ “Αθηνά” στην Αθήνα, Mentoring για έναν (1) μήνα

Ημερολόγιο εκδηλώσεων

S M T W T F S
 
 
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
 
16
 
17
 
18
 
19
 
20
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28