SlideShare a Scribd company logo
1 of 70
Download to read offline
Applying Machine Learning Techniques to Big Data in
the Scholarly Domain
Angelo A. Salatino
Knowledge Media Institute, The Open University, UK
@angelosalatino
5th International School on Applied Probability Theory,
Communications Technologies & Data Science (APTCT-2020)
12 Nov 2020
Agenda
What is Scholarly Data?
Computer Science Ontology
How it has been produced?
What can we do with it?
• Topic classification
• Research trends forecast
• Metadata extraction
• Recommendation of books
• Analyse conferences
About me – Angelo Salatino
Research Associate and Associate Lecturer at the Open University
Research Interests: i) new technologies for classifying scientific
papers according to their relevant research topics, and ii) how the
research output of academia fosters innovation in the industry
At the SKM3 team we produce innovative approaches leveraging
large-scale data mining, semantic technologies, machine learning,
and visual analytics to extract meaning from scholarly data and
shed light on the research dynamic
angelo.salatino@open.ac.uk https://salatino.org @angelosalatino
Science of Science
“The science of science places the practice of science itself under the
microscope, leading to a quantitative understanding of the genesis of
scientific discovery, creativity, and practice and developing tools and
policies aimed at accelerating scientific progress.”
Fortunato, Santo, et al. "Science of science." Science 359.6379 (2018).
Picture from the cover of Science Vol 361, Issue 6408
The Computer Science Ontology Framework
This solution supports a variety of high-level
tasks:
i. categorising proceedings in digital
libraries
ii. enhancing semantically the metadata
of scientific publications
iii. generating recommendations
iv. producing smart analytics
v. detecting research trends …
Each layer exploits the underneath layers
Corpus of Research Papers
Klink-2 Algorithm
Computer Science Ontology
CSO Classifier
High-level Applications
Corpus of Research Papers
Klink-2 Algorithm
Computer Science Ontology
CSO Classifier
High-level Applications
Corpus of Research Papers
Research dissemination
Scholarly Data
Improving Editorial Workflow and
Metadata Quality at Springer Nature.
Identifying the research topics that best describe the scope of a scientific publication is a
crucial task for editors, in particular because the quality of these annotations determine how
effectively users are able to discover the right content in online libraries. For this reason,
Springer Nature, the world’s largest academic book publisher, has traditionally entrusted this
task to their most expert editors. These editors manually analyse all new books, possibly
including hundreds of chapters, and produce a list of the most relevant topics. Hence, this
process has traditionally been very expensive, time-consuming, and confined to a few senior
editors. For these reasons, back in 2016 we developed Smart Topic Miner (STM), an ontology-
driven application that assists the Springer Nature editorial team in annotating the volumes of
all books covering conference proceedings in Computer Science. Since then STM has been
regularly used by editors in Germany, China, Brazil, India, and Japan, …
Angelo Salatino
Francesco Osborne
Aliaksandr Birukou
Enrico Motta
The Open University
Springer Nature
The 18th International Semantic Web Conference (ISWC 2019)
Affiliations
Authors
Citations
References
Conference/Journal
Text: Title, Abstract
Keywords
Scholarly data, Bibliographic metadata, Topic classification, Topic detection, …
Big Scholarly Datasets
• Web of Science
• Scopus
• Google Scholar
• Microsoft Academic Graph
• MA-KG, ma-graph.org
• PubMed
• Dimensions
• Semantic Scholar
• DBLP
• Open Academic Graph
• ScholarlyData
• PID Graph
• Open Research Knowledge Graph
• OpenCitations
• OpenAIRE research graph
• Crossref
• Academy/Industry Dynamics KG
Differences between datasets
All these datasets are different
from each other:
• size
• scope
• quality
• mistakes, author disambiguation
• WoS > Scopus > MAG
• index vs. scraping
• comprehensiveness
• integration with other sources
• access to data: license
Picture from Martijn Visser, Nees Jan van Eck, and Ludo Waltman. "Large-scale comparison of bibliographic
data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic." (2020).
“The comparison considers all scientific documents from the
period 2008–2017 covered by these data sources.”
Taxonomies of Science
What is a Taxonomy?
A taxonomy is a categorization or classification according to discrete sets.
Taxonomies are typically organised in a hierarchical structure.
In practice, it’s a tree structure, with a root node at the top: a single
classification that applies to all objects.
Nodes below are more specific classifications applying to subsets of
objects.
The progress of reasoning proceeds from the general to the more specific.
Example of taxonomies:
• Taxonomy to classify organisms
• Plant taxonomy
• Phylogenetic tree
• Virus classification
• Taxonomies of Science
Why do we need Taxonomies of Science?
Also called Knowledge Organization Systems, they help to organise digital
libraries.
They represent the structure of disciplines by naming all their sub-disciplines
and research topics
Taxonomies of Research Areas
Mathematics Subject
Classification – MSC2010
Physics and Astronomy
Classification Scheme
(PACS)
JEL Classification
System
Library of Congress
Classification (LCC)
Computing
Classification System
(CCS)
Problem with state-of-the-art taxonomies
These taxonomies are:
• Manually curated
• Tend to outdate quickly
• Coarse-grained
• Low completeness
They are unable to reflect the complex structure and the depth of a
discipline
The Computer Science Ontology (CSO)
• Ontology of research areas*, automatically generated using Klink-2**
algorithm, on a dataset of 16 million publications mainly in Computer
Science
• Current version of CSO includes 14K topics and 159K relationships
• Main roots include Computer Science, Linguistic, Mathematics,
Geometry, Semantics and so on.
• Download CSO from https://cso.kmi.open.ac.uk
* Angelo A Salatino, Thiviyan Thanapalasingam, Andrea Mannocci, Francesco Osborne, Enrico Motta. "The Computer Science
Ontology: A Large-Scale Taxonomy of Research Areas." In ISWC 2018, Monterey, CA (USA).
** Francesco Osborne, and Enrico Motta. "Klink-2: integrating multiple web sources to generate semantic topic networks." In
ISWC 2015, Bethlehem, PA (USA).
Why CSO is an Ontology and not a Taxonomy? Differences
An ontology is a formal description of
knowledge as a set of concepts within a
domain and the relationships that hold
between them
Taxonomies identify hierarchical
relationships within a category
Ontologies take taxonomy a stepfurther
by providing richer information,
including information about the
relationships between entities.
was_born
cristiano_ronaldo
“Cristiano Ronaldo”“1985”
juventus
real_madrid
“Juventus F.C.”
“Real Madrid C.F.”
was_born has_name
current_teamformer_team
has_name
has_name
“7”
jersey_number
zinedine_zidane
former_teamcurrent_manager_of
“1972”
has_name
“Zinedine Zidane”
Data Model of the Computer Science Ontology
The CSO data model includes eight semantic relations:
• superTopicOf, which indicates that a topic is a sub-area of another one (e.g., Linked Data, Semantic Web).
• relatedEquivalent, which indicates that two topics can be treated as equivalent for the purpose of exploring research
data (e.g., Ontology Matching, Ontology Alignment).
• contributesTo, which indicates that the research outputs of one topic contributes to another. For instance, research
in Ontology Engineering contributes to the Semantic Web, but arguably Ontology Engineering is not a sub-area of the
Semantic Web – but arguably Ontology Engineering is not a sub-area of Semantic Web – that is, there is plenty of
research in Ontology Engineering outside the Semantic Web area.
• owl:sameAs, this relation indicates that a research concepts is identical to an external resource. We used DBpedia
Spotlight to connect research concepts to Dbpedia.
• primaryLabel, this relation is used to state the main label for topics belonging to a cluster of relatedEquivalent. For
instance, the topics Ontology Matching and Ontology Alignment will both have their primaryLabel set to Ontology
Matching.
• rdf:type, this relation is used to state that a resource is an instance of a class. For example, a resource in our ontology
is an instance of topic.
• rdfs:label, this relation is used to provide a human-readable version of a resource’s name.
• schema:relatedLink, which links CSO concepts to related web pages that either describe the research topics
(Wikipedia articles) or provide additional information about the research domains (Microsoft Academic).
Computer Science Ontology
Very fine grained, organised in 13 levels
Spans from general areas:
• Computer Science
• Artificial Intelligence
• Human Computer Interaction
• Software Engineering …
To specific areas:
• Deep Belief Networks
• Dynamic Bayesian Networks
• Neuro-fuzzy Controller …
Klink-2 Algorithm
Klink-2 Algorithm
Klink-2 is an approach for learning large-scale
ontologies of research topics from corpora of
scientific articles and knowledge sources on
the web.
Given a pair of keywords it infers their
semantic relationship:
• superTopicOf
• contributesTo
• relatedEquivalent
Picture from Francesco Osborne, and Enrico Motta. "Klink-2: integrating multiple web sources to
generate semantic topic networks." In ISWC 2015, Bethlehem, PA (USA).
relatedEquivalent
skos:broaderGeneric
contributesTo
Klink-2 Algorithm
Given x and y being two topics:
hierarchical relationship (superTopicOf, contributesTo)
relatedEquivalent relationship
IR(x, y) is the number of
papers associated with
both x and y
cR(x, y) measures how
similar are the distributions
of topics with which both
topic x and y co-occur
n(x, y) defines the string
similarity between the two
topics using the normalised
Levenshtein distance
super = super topics
sib = siblings
Topic Classification
Topic Classification
Aims at identifying the relevant subjects of a set of documents.
In the scholarly domain: identifying research topics within scientific
articles.
State of the art:
• Topic Models (i.e. LDA)
• Machine Learning
• Citation Networks
• Natural Language Processing (CSO Classifier)
Topic Models
Latent Dirichlet Allocation* and many of it
derivatives
Represent each document as a mixture of
topics, and a topic is a multinomial distribution
over words characterised as a discrete
probability distribution defining the likelihood
that each word will appear in a given topic
You need to set some hyperparameters and
pre-define the number of topics
* Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (2003). "Latent Dirichlet Allocation". Journal
of Machine Learning Research. 3 (4–5): pp. 993–1022.
Picture from Kim, Taewoo et al. (2019). Insider Threat Detection Based on User
Behavior Modeling and Anomaly Detection Algorithms. Applied Sciences. 9.
4018.
Topic Models - Test
Research Paper
Topic 1 Topic 2 Topic 3 Topic 4
0.11504013 0.10714792 0.09840762 0.060187772
Machine Learning
Supervised approach: multiclass model (a class for each topic) able to
classify single subjects within documents.
Machine Learning
CHALLENGES: labelled corpus with many
instances and balanced across classes
(subjects).
You can mitigate these challenges working
with less classes at high level of granularity
Picture from Radha, Suja & Bandaru, Rama Krishna Rao. (2011). Taxonomy construction
techniques - issues and challenges. Civil Eng.. 2.
Citation Networks
Standing on the shoulder of giants:
• Science progresses by building on the work of
previous scientists and we credit previous work
through references
• Assumption: we tend to cite works from the same
field
• This information (papers + citations) is organised
in a network structure
• Topics, or scientific areas, or “hot fields” are
identified by clustering such network
Citation Networks in practice
Paper
A
Paper
B
Time
Paper
C
Paper
D
Paper
E
Paper
G
Paper
H
Paper
I
Paper
F
Paper
J
Paper
K
Paper
L
Paper
M
Paper A cites:
• Paper C
• Paper D
• Paper G
Node
Vertex
Item
Object
Paper
Edge
Link
Tie
Arc
Relation
Cites
Citation Networks - Clustering
• Clusters are cohesive groups of nodes.
• Clustering algorithm (community detection) looks at the topology of the
network identifying areas in which nodes are more connected between
themselves than to the rest of the network.
Citation Networks – Clustering Algorithms
• Edge Betweenness
• Fast Greedy (greedy optimization of modularity)
• Info map
• Label propagation
• Leading eigenvector (eigenvector of the community matrix)
• Louvain (multilevel optimization of modularity)
• Leiden (local optimization of modularity)
• Spinglass (statistical mechanics)
• Walktrap (short random walks)
• Clique percolation method
• … and many others
Citation Networks - CitNetExplorer
• Papers arranged on a timeline
• Each colour represents a cluster
• Identify the characteristic terms for
each cluster
Picture from Van Eck, Nees Jan, and Ludo Waltman. "Citation-based clustering of
publications using CitNetExplorer and VOSviewer." Scientometrics 111.2 (2017): 1053-1070.
Citation Networks - VOSViewer
• Each colour represents a co-
citation cluster which defines a
scientific area
Picture from https://www.tudelft.nl/library/actuele-themas/research-analytics/case-12-citation-
networks-2/
Citation Network
• Widely adopted in literature
Problem:
• Given a cluster one needs to identify the characteristic topic
• Research papers defining new topics might need some years to gather
citations.
• This approach is ineffective for the early detection of new research topics
Natural Language Processing - CSO Classifier
Uses state-of-the-art technologies to parse documents and recognise
research entities. As input, it takes the metadata associated with a
research paper (title, abstract, keywords) and returns a selection of
research concepts drawn from the Computer Science Ontology
Salatino, Angelo A., et al. "The CSO classifier:
Ontology-driven detection of research topics in
scholarly articles." International Conference on Theory
and Practice of Digital Libraries. Springer, Cham, 2019.
Syntactic Module
Syntactic Module
• We split the text in unigrams, bigrams and trigrams
• For each n-gram we measure the Levenshtein similarity with the topics
in CSO
• We select CSO topics having similarity above or equal to 0.94 with n-
grams
• Helps handling plurals and hyphenated topics, such as:
• “knowledge based systems” and “knowledge-based systems”
• “database” and “databases”
Semantic Module
Semantic Module
Word Embedding model
• We used titles and abstracts from 4.5M papers in Computer Science
• Pre-processed text:
• Topic replacement – “digital libraries” → “digital_libraries”
• Collocation analysis – “highest_accuracies”, “highly_cited_journals”
• Trained word2vec model
method
skipgram
emb. size
128
window size
10
negative
5
max iter.
5
min-count cutoff
10
Word Embedding model
“king” = [0.32, 0.76,…]
“queen” = [0.42, 0.76,…]
“woman” = [0.56, 0.43,…]
“man” = [0.59, 0.42,...]
king + (woman – man) = queen
It locates synonyms
(related topics) close to
each other in this vector
space: high cosine
similarity
Semantic Module
Entity Extraction
• POS tagger, and grammar-based chunk parser <JJ.*>*<NN.*>+
“digital libraries”
CSO concept identification
• Selects all CSO topics found in the top-10 similar words of the resulting
n-grams (with cosine similarity > 0.7)
Semantic Module
Concept ranking
• We assign a score to each identified topic:
• Frequency – number of times it was inferred
• Diversity – number of unique text chunks from which it was inferred
Concept Selection
• Elbow method
CSO Topic score
domain ontologies 40
semantic web 40
ontology learning 40
data mining 40
heterogeneous resources 24
semantics 24
world wide web 10
network architecture 6
scholarly communication 6
ontology matching 6
… …
Post Processing
Post Processing
Combination of output
Semantic enhancement
• We use the superTopicOf to enhance the output set
• E.g., if “machine learning” then also “artificial intelligence”
• Provides wider context for the analysed paper
• Enables analytics on high-level abstract topics (e.g., digital libraries)
Metadata Extraction
Metadata Extraction – Springer Nature Use Case
• Traditionally, editors choose a list of related keywords and categories in
relevant taxonomies according to:
• their own experience of similar conferences
• a visual exploration of titles and abstracts
• a list of terms given by the curators or derived by calls for papers
Salatino, A. A., Osborne, F., Birukou, A., & Motta, E. (2019, October). Improving editorial workflow and metadata quality at springer nature. In International Semantic Web
Conference (pp. 507-525). Springer, Cham.
Classification of Proceedings – A Complex Problem
Classify publications manually presents a number of issues for a large
editor such as Springer Nature.
• It a complex process that require expert editors
• It is time-consuming process which can hardly scale
• It is easy to miss the emergence of new topics
• It is easy to assume that some traditional topics are still popular when
this is no longer the case
• The keywords used in the call of papers are often a reflection of what a
venue aspires to be, rather than the real contents of the proceedings.
Smart Topic Miner Architecture
Demo of STM: http://stm-demo.kmi.open.ac.uk
SN Editors
HTML - GUI
Parser
Generate
Visualizations
STM Engine
CSO
SNCs
Historical
Data
i) CSO Classifier
ii) Topic Explanation
iii) Taxonomy Generation
iv) SN Tags Inference
v) Previous Classification
word2vec model
Business Value
• STM halves the time needed for classifying proceedings from 30 to 15
minutes
• It allows also junior editors to work on the classification of proceedings,
distributing the load and reducing costs
• It achieved an overall 75% cost reduction
• The adoption of a controlled vocabulary makes the process more robust
and facilitates the identification of related editorial products
Recommendation of Books
Recommendation of Books
Identifying SN Books to be marketed at specific events, such as academic conferences
• Manual book selection has some limitations:
• Requires years of experience and domain-specific knowledge
• Requires browsing through large catalogue of information - syntactic
• Prone to biases
• Aim:
• Provide a more effective way to support the book selection process
• Help drive the operating costs down
• Semi-automated selection of the most appropriate books, journals, and proceedings to
market at a scientific event
• We developed the Smart Book Recommender: http://rexplore.kmi.open.ac.uk/SBR-demo
Thanapalasingam, T., Osborne, F., Birukou, A., & Motta, E. (2018, October). Ontology-based recommendation of editorial products. In International Semantic Web
Conference (pp. 341-358). Springer, Cham.
• We characterised Books and Conference Proceedings through their
research topics:
• The metadata of chapters/papers (i.e. keywords, title and
abstract) are mapped to research topics in CSO
• It returns a set of research topics
Workflow (1)
Classifying
Conferences &
Editorial
Products
Computing
Pairwise
Similarity
Querying &
Visualizing
results
• Computing the cosine similarity of two editorial products using vectors of
research topics and their weights
Workflow (2)
Classifying
Conferences &
Editorial
Products
Computing
Pairwise
Similarity
Querying &
Visualizing
results
Workflow (3)
Classifying
Conferences &
Editorial
Products
Computing
Pairwise
Similarity
Querying &
Visualizing
results
• A web application makes AJAX
requests to query a relational
database and displays the
recommendations
• Visualisations are constructed in real-
time using D3.js
• Interactive analytics for comparing
items
Advanced Visualisation of a Book Topics
Research Trends Forecast
Research Trends Forecast
• We created a new approach for predicting the impact of a topic on
industry.
• It uses four temporal time-series: i) publications from academia, ii) publications
from industry, iii) patents from academia, and iv) patents from industry.
• We tested it on the task of predicting if an emergent research topic will
have a significant impact on industry (> 50 patents) in the following 10
years.
• This evaluation substantiates the hypothesis that considering the four
timeseries separately is conducive to higher quality predictions and
suggests that RI and RA are good indicators for PI.
Data modelling pipeline
Research
Papers
Patents
Fine-grained
representation of
research topics
Computer
Science
Ontology
Filtering
documents
Filtering
documents
CSO
Classifier
Extraction of
affiliation types
Peoples' Friendship University of Russia
Salatino, A., Osborne, F., & Motta, E. (2020, September).
Researchflow: Understanding the knowledge flow between
academia and industry. In International Conference on
Knowledge Engineering and Knowledge Management (pp.
219-236). Springer, Cham.
Scholarly Data++
Improving Editorial Workflow and
Metadata Quality at Springer Nature.
Identifying the research topics that best describe the scope of a scientific publication is a
crucial task for editors, in particular because the quality of these annotations determine how
effectively users are able to discover the right content in online libraries. For this reason,
Springer Nature, the world’s largest academic book publisher, has traditionally entrusted this
task to their most expert editors. These editors manually analyse all new books, possibly
including hundreds of chapters, and produce a list of the most relevant topics. Hence, this
process has traditionally been very expensive, time-consuming, and confined to a few senior
editors. For these reasons, back in 2016 we developed Smart Topic Miner (STM), an ontology-
driven application that assists the Springer Nature editorial team in annotating the volumes of
all books covering conference proceedings in Computer Science. Since then STM has been
regularly used by editors in Germany, China, Brazil, India, and Japan, …
Angelo Salatino
Francesco Osborne
Aliaksandr Birukou
Enrico Motta
The Open University
Springer Nature
The 18th International Semantic Web Conference (ISWC 2019)
Affiliations
Authors
Citations
References
Conference/Journal
Text: Title, Abstract, Keywords
scholarly data, semantic web, data mining, ontology, digital libraries, …
Topics
Affiliation Types
Academia
Industry
Keywords
Scholarly data, Bibliographic metadata, Topic classification,
Research Topic
Each research topic is represented through 4 signals:
Papers from Academia (RA)
Papers from Industry (RI)
Patents from Academia (PA)
Patents from Industry (PI)
Machine Learning approach
We used:
• Logistic Regression (LR)
• Random Forest (RF)
• AdaBoost (AB)
• Convoluted Neural Network (CNN)
• Long Short-term Memory Neural Network (LSTM)
On several combinations of time-series: RA, RI, PA and PI
Forecasting Topic Impact on Industry
Analysing Conferences
Conference Dashboard
Angioni, Simone, et al. "The AIDA Dashboard: Analysing Conferences with Semantic Technologies."
Conference Dashboard
Conference Dashboard
Conference Dashboard
Conclusions
• We have seen how effective is the Computer Science
Ontology Framework
• It enables us to combine machine learning
algorithms and semantic technologies to produce
high-level applications for gaining insights in the field
of Science of Science
• Future work: applying this framework in other
domains of Science Corpus of Research Papers
Klink-2 Algorithm
Computer Science Ontology
CSO Classifier
High-level Applications
Corpus of Research Papers
Klink-2 Algorithm
Computer Science Ontology
CSO Classifier
High-level Applications
Francesco
Osborne
Angelo
Salatino
Simone
Angioni
Enrico
Motta
Scholarly Knowledge
Modelling, Mining and SenseMaking
Danilo
Dessì

More Related Content

What's hot

End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text Paul Groth
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-cziPaul Groth
 
Project Proposal Topics Modeling (Ir)
Project Proposal    Topics Modeling (Ir)Project Proposal    Topics Modeling (Ir)
Project Proposal Topics Modeling (Ir)Svitlana volkova
 
Knowledge Management Cultures: A Comparison of Engineering and Cultural Scien...
Knowledge Management Cultures: A Comparison of Engineering and Cultural Scien...Knowledge Management Cultures: A Comparison of Engineering and Cultural Scien...
Knowledge Management Cultures: A Comparison of Engineering and Cultural Scien...Ralf Klamma
 
Oop principles a good book
Oop principles a good bookOop principles a good book
Oop principles a good booklahorisher
 
Data and Knowledge as Commodities
Data and Knowledge as CommoditiesData and Knowledge as Commodities
Data and Knowledge as CommoditiesMathieu d'Aquin
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Lucy McKenna
 
Data and Software in Scientific Activities: a Literature Review
Data and Software in Scientific Activities: a Literature ReviewData and Software in Scientific Activities: a Literature Review
Data and Software in Scientific Activities: a Literature ReviewKai Li
 
Automatic and unsupervised topic discovery in social networks
Automatic and unsupervised topic discovery in social networksAutomatic and unsupervised topic discovery in social networks
Automatic and unsupervised topic discovery in social networksAntonio Moreno
 
Contractor-Borner-SNA-SAC
Contractor-Borner-SNA-SACContractor-Borner-SNA-SAC
Contractor-Borner-SNA-SACwebuploader
 
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...ijaia
 
Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...
Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...
Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...Open Knowledge Maps
 
Empowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic EnrichmentEmpowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic EnrichmentThe Digital Group
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chainPaul Groth
 

What's hot (20)

Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
Project Proposal Topics Modeling (Ir)
Project Proposal    Topics Modeling (Ir)Project Proposal    Topics Modeling (Ir)
Project Proposal Topics Modeling (Ir)
 
Ir 01
Ir   01Ir   01
Ir 01
 
Knowledge Management Cultures: A Comparison of Engineering and Cultural Scien...
Knowledge Management Cultures: A Comparison of Engineering and Cultural Scien...Knowledge Management Cultures: A Comparison of Engineering and Cultural Scien...
Knowledge Management Cultures: A Comparison of Engineering and Cultural Scien...
 
Oop principles a good book
Oop principles a good bookOop principles a good book
Oop principles a good book
 
Data and Knowledge as Commodities
Data and Knowledge as CommoditiesData and Knowledge as Commodities
Data and Knowledge as Commodities
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
 
Data and Software in Scientific Activities: a Literature Review
Data and Software in Scientific Activities: a Literature ReviewData and Software in Scientific Activities: a Literature Review
Data and Software in Scientific Activities: a Literature Review
 
Automatic and unsupervised topic discovery in social networks
Automatic and unsupervised topic discovery in social networksAutomatic and unsupervised topic discovery in social networks
Automatic and unsupervised topic discovery in social networks
 
Contractor-Borner-SNA-SAC
Contractor-Borner-SNA-SACContractor-Borner-SNA-SAC
Contractor-Borner-SNA-SAC
 
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
 
CV
CVCV
CV
 
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
 
Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...
Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...
Harnessing User Library Statistics for Research Evaluation and Knowledge Doma...
 
Empowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic EnrichmentEmpowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic Enrichment
 
What is What, When?
What is What, When?What is What, When?
What is What, When?
 
Data and Research Infrastructures and Open Science
Data and Research Infrastructures and Open ScienceData and Research Infrastructures and Open Science
Data and Research Infrastructures and Open Science
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
 

Similar to Applying machine learning techniques to big data in the scholarly domain

Scientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an OverviewScientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an OverviewAngelo Salatino
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
 
Connected Data for Machine Learning | Paul Groth
Connected Data for Machine Learning | Paul GrothConnected Data for Machine Learning | Paul Groth
Connected Data for Machine Learning | Paul GrothConnected Data World
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicinePaul Groth
 
[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...
[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...
[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...DataScienceConferenc1
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the WebRinke Hoekstra
 
How to use science maps to navigate large information spaces? What is the lin...
How to use science maps to navigate large information spaces? What is the lin...How to use science maps to navigate large information spaces? What is the lin...
How to use science maps to navigate large information spaces? What is the lin...Andrea Scharnhorst
 
ICDMWorkshopProposal.doc
ICDMWorkshopProposal.docICDMWorkshopProposal.doc
ICDMWorkshopProposal.docbutest
 
Mining and Supporting Community Structures in Sensor Network Research
Mining and Supporting Community Structures in Sensor Network ResearchMining and Supporting Community Structures in Sensor Network Research
Mining and Supporting Community Structures in Sensor Network ResearchMarko Rodriguez
 
OpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of DataOpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of Dataopenminted_eu
 
Grounded Theory
Grounded TheoryGrounded Theory
Grounded Theorylitdoc1999
 
Learning Relations from Social Tagging Data
Learning Relations from Social Tagging DataLearning Relations from Social Tagging Data
Learning Relations from Social Tagging DataHang Dong
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of DataPaul Groth
 
The Future of Research (Science and Technology)
The Future of Research (Science and Technology)The Future of Research (Science and Technology)
The Future of Research (Science and Technology)Duncan Hull
 
Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...Salam Shah
 
Finding Your Literature Match - A Recommender System
Finding Your Literature Match - A Recommender SystemFinding Your Literature Match - A Recommender System
Finding Your Literature Match - A Recommender SystemEdwin Henneken
 
Design Science in Information Systems
Design Science in Information SystemsDesign Science in Information Systems
Design Science in Information SystemsSergej Lugovic
 

Similar to Applying machine learning techniques to big data in the scholarly domain (20)

Scientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an OverviewScientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an Overview
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
Connected Data for Machine Learning | Paul Groth
Connected Data for Machine Learning | Paul GrothConnected Data for Machine Learning | Paul Groth
Connected Data for Machine Learning | Paul Groth
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...
[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...
[DSC Croatia 22] Writing scientific papers about data science projects - Mirj...
 
Scienceofscience
ScienceofscienceScienceofscience
Scienceofscience
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
 
How to use science maps to navigate large information spaces? What is the lin...
How to use science maps to navigate large information spaces? What is the lin...How to use science maps to navigate large information spaces? What is the lin...
How to use science maps to navigate large information spaces? What is the lin...
 
Facilitating Primary Student Teachers’ Development of Critical Thinking Throu...
Facilitating Primary Student Teachers’ Development of Critical Thinking Throu...Facilitating Primary Student Teachers’ Development of Critical Thinking Throu...
Facilitating Primary Student Teachers’ Development of Critical Thinking Throu...
 
ICDMWorkshopProposal.doc
ICDMWorkshopProposal.docICDMWorkshopProposal.doc
ICDMWorkshopProposal.doc
 
Mining and Supporting Community Structures in Sensor Network Research
Mining and Supporting Community Structures in Sensor Network ResearchMining and Supporting Community Structures in Sensor Network Research
Mining and Supporting Community Structures in Sensor Network Research
 
OpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of DataOpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of Data
 
Grounded Theory
Grounded TheoryGrounded Theory
Grounded Theory
 
Learning Relations from Social Tagging Data
Learning Relations from Social Tagging DataLearning Relations from Social Tagging Data
Learning Relations from Social Tagging Data
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of Data
 
The Future of Research (Science and Technology)
The Future of Research (Science and Technology)The Future of Research (Science and Technology)
The Future of Research (Science and Technology)
 
Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...
 
Finding Your Literature Match - A Recommender System
Finding Your Literature Match - A Recommender SystemFinding Your Literature Match - A Recommender System
Finding Your Literature Match - A Recommender System
 
Design Science in Information Systems
Design Science in Information SystemsDesign Science in Information Systems
Design Science in Information Systems
 

More from Angelo Salatino

ResearchFlow: Understanding the Knowledge Flow between Academia and Industry
ResearchFlow: Understanding the Knowledge Flow between Academia and IndustryResearchFlow: Understanding the Knowledge Flow between Academia and Industry
ResearchFlow: Understanding the Knowledge Flow between Academia and IndustryAngelo Salatino
 
Early Detection of Research Trends [thesis defence]
Early Detection of Research Trends [thesis defence]Early Detection of Research Trends [thesis defence]
Early Detection of Research Trends [thesis defence]Angelo Salatino
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
 
Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics Angelo Salatino
 
AUGUR: Forecasting the Emergence of New Research Topics
AUGUR: Forecasting the Emergence of New Research TopicsAUGUR: Forecasting the Emergence of New Research Topics
AUGUR: Forecasting the Emergence of New Research TopicsAngelo Salatino
 
Detection of Embryonic Research Topics by Analysing Semantic Topic Networks
Detection of Embryonic Research Topics by Analysing Semantic Topic NetworksDetection of Embryonic Research Topics by Analysing Semantic Topic Networks
Detection of Embryonic Research Topics by Analysing Semantic Topic NetworksAngelo Salatino
 
Early Detection and Forecasting of Research Trends
Early Detection and Forecasting of Research TrendsEarly Detection and Forecasting of Research Trends
Early Detection and Forecasting of Research TrendsAngelo Salatino
 
Introductory Lecture to Audio Signal Processing
Introductory Lecture to Audio Signal ProcessingIntroductory Lecture to Audio Signal Processing
Introductory Lecture to Audio Signal ProcessingAngelo Salatino
 

More from Angelo Salatino (9)

ResearchFlow: Understanding the Knowledge Flow between Academia and Industry
ResearchFlow: Understanding the Knowledge Flow between Academia and IndustryResearchFlow: Understanding the Knowledge Flow between Academia and Industry
ResearchFlow: Understanding the Knowledge Flow between Academia and Industry
 
Early Detection of Research Trends [thesis defence]
Early Detection of Research Trends [thesis defence]Early Detection of Research Trends [thesis defence]
Early Detection of Research Trends [thesis defence]
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics
 
AUGUR: Forecasting the Emergence of New Research Topics
AUGUR: Forecasting the Emergence of New Research TopicsAUGUR: Forecasting the Emergence of New Research Topics
AUGUR: Forecasting the Emergence of New Research Topics
 
Detection of Embryonic Research Topics by Analysing Semantic Topic Networks
Detection of Embryonic Research Topics by Analysing Semantic Topic NetworksDetection of Embryonic Research Topics by Analysing Semantic Topic Networks
Detection of Embryonic Research Topics by Analysing Semantic Topic Networks
 
Early Detection and Forecasting of Research Trends
Early Detection and Forecasting of Research TrendsEarly Detection and Forecasting of Research Trends
Early Detection and Forecasting of Research Trends
 
Tesi Triennale Slide
Tesi Triennale SlideTesi Triennale Slide
Tesi Triennale Slide
 
Introductory Lecture to Audio Signal Processing
Introductory Lecture to Audio Signal ProcessingIntroductory Lecture to Audio Signal Processing
Introductory Lecture to Audio Signal Processing
 

Recently uploaded

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to VirusesAreesha Ahmad
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Youngkajalvid75
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 

Recently uploaded (20)

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 

Applying machine learning techniques to big data in the scholarly domain

  • 1. Applying Machine Learning Techniques to Big Data in the Scholarly Domain Angelo A. Salatino Knowledge Media Institute, The Open University, UK @angelosalatino 5th International School on Applied Probability Theory, Communications Technologies & Data Science (APTCT-2020) 12 Nov 2020
  • 2. Agenda What is Scholarly Data? Computer Science Ontology How it has been produced? What can we do with it? • Topic classification • Research trends forecast • Metadata extraction • Recommendation of books • Analyse conferences
  • 3. About me – Angelo Salatino Research Associate and Associate Lecturer at the Open University Research Interests: i) new technologies for classifying scientific papers according to their relevant research topics, and ii) how the research output of academia fosters innovation in the industry At the SKM3 team we produce innovative approaches leveraging large-scale data mining, semantic technologies, machine learning, and visual analytics to extract meaning from scholarly data and shed light on the research dynamic angelo.salatino@open.ac.uk https://salatino.org @angelosalatino
  • 4. Science of Science “The science of science places the practice of science itself under the microscope, leading to a quantitative understanding of the genesis of scientific discovery, creativity, and practice and developing tools and policies aimed at accelerating scientific progress.” Fortunato, Santo, et al. "Science of science." Science 359.6379 (2018). Picture from the cover of Science Vol 361, Issue 6408
  • 5. The Computer Science Ontology Framework This solution supports a variety of high-level tasks: i. categorising proceedings in digital libraries ii. enhancing semantically the metadata of scientific publications iii. generating recommendations iv. producing smart analytics v. detecting research trends … Each layer exploits the underneath layers Corpus of Research Papers Klink-2 Algorithm Computer Science Ontology CSO Classifier High-level Applications Corpus of Research Papers Klink-2 Algorithm Computer Science Ontology CSO Classifier High-level Applications
  • 8. Scholarly Data Improving Editorial Workflow and Metadata Quality at Springer Nature. Identifying the research topics that best describe the scope of a scientific publication is a crucial task for editors, in particular because the quality of these annotations determine how effectively users are able to discover the right content in online libraries. For this reason, Springer Nature, the world’s largest academic book publisher, has traditionally entrusted this task to their most expert editors. These editors manually analyse all new books, possibly including hundreds of chapters, and produce a list of the most relevant topics. Hence, this process has traditionally been very expensive, time-consuming, and confined to a few senior editors. For these reasons, back in 2016 we developed Smart Topic Miner (STM), an ontology- driven application that assists the Springer Nature editorial team in annotating the volumes of all books covering conference proceedings in Computer Science. Since then STM has been regularly used by editors in Germany, China, Brazil, India, and Japan, … Angelo Salatino Francesco Osborne Aliaksandr Birukou Enrico Motta The Open University Springer Nature The 18th International Semantic Web Conference (ISWC 2019) Affiliations Authors Citations References Conference/Journal Text: Title, Abstract Keywords Scholarly data, Bibliographic metadata, Topic classification, Topic detection, …
  • 9. Big Scholarly Datasets • Web of Science • Scopus • Google Scholar • Microsoft Academic Graph • MA-KG, ma-graph.org • PubMed • Dimensions • Semantic Scholar • DBLP • Open Academic Graph • ScholarlyData • PID Graph • Open Research Knowledge Graph • OpenCitations • OpenAIRE research graph • Crossref • Academy/Industry Dynamics KG
  • 10. Differences between datasets All these datasets are different from each other: • size • scope • quality • mistakes, author disambiguation • WoS > Scopus > MAG • index vs. scraping • comprehensiveness • integration with other sources • access to data: license Picture from Martijn Visser, Nees Jan van Eck, and Ludo Waltman. "Large-scale comparison of bibliographic data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic." (2020). “The comparison considers all scientific documents from the period 2008–2017 covered by these data sources.”
  • 12. What is a Taxonomy? A taxonomy is a categorization or classification according to discrete sets. Taxonomies are typically organised in a hierarchical structure. In practice, it’s a tree structure, with a root node at the top: a single classification that applies to all objects. Nodes below are more specific classifications applying to subsets of objects. The progress of reasoning proceeds from the general to the more specific. Example of taxonomies: • Taxonomy to classify organisms • Plant taxonomy • Phylogenetic tree • Virus classification • Taxonomies of Science
  • 13. Why do we need Taxonomies of Science? Also called Knowledge Organization Systems, they help to organise digital libraries. They represent the structure of disciplines by naming all their sub-disciplines and research topics
  • 14. Taxonomies of Research Areas Mathematics Subject Classification – MSC2010 Physics and Astronomy Classification Scheme (PACS) JEL Classification System Library of Congress Classification (LCC) Computing Classification System (CCS)
  • 15. Problem with state-of-the-art taxonomies These taxonomies are: • Manually curated • Tend to outdate quickly • Coarse-grained • Low completeness They are unable to reflect the complex structure and the depth of a discipline
  • 16. The Computer Science Ontology (CSO) • Ontology of research areas*, automatically generated using Klink-2** algorithm, on a dataset of 16 million publications mainly in Computer Science • Current version of CSO includes 14K topics and 159K relationships • Main roots include Computer Science, Linguistic, Mathematics, Geometry, Semantics and so on. • Download CSO from https://cso.kmi.open.ac.uk * Angelo A Salatino, Thiviyan Thanapalasingam, Andrea Mannocci, Francesco Osborne, Enrico Motta. "The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas." In ISWC 2018, Monterey, CA (USA). ** Francesco Osborne, and Enrico Motta. "Klink-2: integrating multiple web sources to generate semantic topic networks." In ISWC 2015, Bethlehem, PA (USA).
  • 17. Why CSO is an Ontology and not a Taxonomy? Differences An ontology is a formal description of knowledge as a set of concepts within a domain and the relationships that hold between them Taxonomies identify hierarchical relationships within a category Ontologies take taxonomy a stepfurther by providing richer information, including information about the relationships between entities. was_born cristiano_ronaldo “Cristiano Ronaldo”“1985” juventus real_madrid “Juventus F.C.” “Real Madrid C.F.” was_born has_name current_teamformer_team has_name has_name “7” jersey_number zinedine_zidane former_teamcurrent_manager_of “1972” has_name “Zinedine Zidane”
  • 18. Data Model of the Computer Science Ontology The CSO data model includes eight semantic relations: • superTopicOf, which indicates that a topic is a sub-area of another one (e.g., Linked Data, Semantic Web). • relatedEquivalent, which indicates that two topics can be treated as equivalent for the purpose of exploring research data (e.g., Ontology Matching, Ontology Alignment). • contributesTo, which indicates that the research outputs of one topic contributes to another. For instance, research in Ontology Engineering contributes to the Semantic Web, but arguably Ontology Engineering is not a sub-area of the Semantic Web – but arguably Ontology Engineering is not a sub-area of Semantic Web – that is, there is plenty of research in Ontology Engineering outside the Semantic Web area. • owl:sameAs, this relation indicates that a research concepts is identical to an external resource. We used DBpedia Spotlight to connect research concepts to Dbpedia. • primaryLabel, this relation is used to state the main label for topics belonging to a cluster of relatedEquivalent. For instance, the topics Ontology Matching and Ontology Alignment will both have their primaryLabel set to Ontology Matching. • rdf:type, this relation is used to state that a resource is an instance of a class. For example, a resource in our ontology is an instance of topic. • rdfs:label, this relation is used to provide a human-readable version of a resource’s name. • schema:relatedLink, which links CSO concepts to related web pages that either describe the research topics (Wikipedia articles) or provide additional information about the research domains (Microsoft Academic).
  • 19. Computer Science Ontology Very fine grained, organised in 13 levels Spans from general areas: • Computer Science • Artificial Intelligence • Human Computer Interaction • Software Engineering … To specific areas: • Deep Belief Networks • Dynamic Bayesian Networks • Neuro-fuzzy Controller …
  • 21. Klink-2 Algorithm Klink-2 is an approach for learning large-scale ontologies of research topics from corpora of scientific articles and knowledge sources on the web. Given a pair of keywords it infers their semantic relationship: • superTopicOf • contributesTo • relatedEquivalent Picture from Francesco Osborne, and Enrico Motta. "Klink-2: integrating multiple web sources to generate semantic topic networks." In ISWC 2015, Bethlehem, PA (USA). relatedEquivalent skos:broaderGeneric contributesTo
  • 22. Klink-2 Algorithm Given x and y being two topics: hierarchical relationship (superTopicOf, contributesTo) relatedEquivalent relationship IR(x, y) is the number of papers associated with both x and y cR(x, y) measures how similar are the distributions of topics with which both topic x and y co-occur n(x, y) defines the string similarity between the two topics using the normalised Levenshtein distance super = super topics sib = siblings
  • 24. Topic Classification Aims at identifying the relevant subjects of a set of documents. In the scholarly domain: identifying research topics within scientific articles. State of the art: • Topic Models (i.e. LDA) • Machine Learning • Citation Networks • Natural Language Processing (CSO Classifier)
  • 25. Topic Models Latent Dirichlet Allocation* and many of it derivatives Represent each document as a mixture of topics, and a topic is a multinomial distribution over words characterised as a discrete probability distribution defining the likelihood that each word will appear in a given topic You need to set some hyperparameters and pre-define the number of topics * Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (2003). "Latent Dirichlet Allocation". Journal of Machine Learning Research. 3 (4–5): pp. 993–1022. Picture from Kim, Taewoo et al. (2019). Insider Threat Detection Based on User Behavior Modeling and Anomaly Detection Algorithms. Applied Sciences. 9. 4018.
  • 26. Topic Models - Test Research Paper Topic 1 Topic 2 Topic 3 Topic 4 0.11504013 0.10714792 0.09840762 0.060187772
  • 27. Machine Learning Supervised approach: multiclass model (a class for each topic) able to classify single subjects within documents.
  • 28. Machine Learning CHALLENGES: labelled corpus with many instances and balanced across classes (subjects). You can mitigate these challenges working with less classes at high level of granularity Picture from Radha, Suja & Bandaru, Rama Krishna Rao. (2011). Taxonomy construction techniques - issues and challenges. Civil Eng.. 2.
  • 29. Citation Networks Standing on the shoulder of giants: • Science progresses by building on the work of previous scientists and we credit previous work through references • Assumption: we tend to cite works from the same field • This information (papers + citations) is organised in a network structure • Topics, or scientific areas, or “hot fields” are identified by clustering such network
  • 30. Citation Networks in practice Paper A Paper B Time Paper C Paper D Paper E Paper G Paper H Paper I Paper F Paper J Paper K Paper L Paper M Paper A cites: • Paper C • Paper D • Paper G Node Vertex Item Object Paper Edge Link Tie Arc Relation Cites
  • 31. Citation Networks - Clustering • Clusters are cohesive groups of nodes. • Clustering algorithm (community detection) looks at the topology of the network identifying areas in which nodes are more connected between themselves than to the rest of the network.
  • 32. Citation Networks – Clustering Algorithms • Edge Betweenness • Fast Greedy (greedy optimization of modularity) • Info map • Label propagation • Leading eigenvector (eigenvector of the community matrix) • Louvain (multilevel optimization of modularity) • Leiden (local optimization of modularity) • Spinglass (statistical mechanics) • Walktrap (short random walks) • Clique percolation method • … and many others
  • 33. Citation Networks - CitNetExplorer • Papers arranged on a timeline • Each colour represents a cluster • Identify the characteristic terms for each cluster Picture from Van Eck, Nees Jan, and Ludo Waltman. "Citation-based clustering of publications using CitNetExplorer and VOSviewer." Scientometrics 111.2 (2017): 1053-1070.
  • 34. Citation Networks - VOSViewer • Each colour represents a co- citation cluster which defines a scientific area Picture from https://www.tudelft.nl/library/actuele-themas/research-analytics/case-12-citation- networks-2/
  • 35. Citation Network • Widely adopted in literature Problem: • Given a cluster one needs to identify the characteristic topic • Research papers defining new topics might need some years to gather citations. • This approach is ineffective for the early detection of new research topics
  • 36. Natural Language Processing - CSO Classifier Uses state-of-the-art technologies to parse documents and recognise research entities. As input, it takes the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the Computer Science Ontology Salatino, Angelo A., et al. "The CSO classifier: Ontology-driven detection of research topics in scholarly articles." International Conference on Theory and Practice of Digital Libraries. Springer, Cham, 2019.
  • 38. Syntactic Module • We split the text in unigrams, bigrams and trigrams • For each n-gram we measure the Levenshtein similarity with the topics in CSO • We select CSO topics having similarity above or equal to 0.94 with n- grams • Helps handling plurals and hyphenated topics, such as: • “knowledge based systems” and “knowledge-based systems” • “database” and “databases”
  • 40. Semantic Module Word Embedding model • We used titles and abstracts from 4.5M papers in Computer Science • Pre-processed text: • Topic replacement – “digital libraries” → “digital_libraries” • Collocation analysis – “highest_accuracies”, “highly_cited_journals” • Trained word2vec model method skipgram emb. size 128 window size 10 negative 5 max iter. 5 min-count cutoff 10
  • 41. Word Embedding model “king” = [0.32, 0.76,…] “queen” = [0.42, 0.76,…] “woman” = [0.56, 0.43,…] “man” = [0.59, 0.42,...] king + (woman – man) = queen It locates synonyms (related topics) close to each other in this vector space: high cosine similarity
  • 42. Semantic Module Entity Extraction • POS tagger, and grammar-based chunk parser <JJ.*>*<NN.*>+ “digital libraries” CSO concept identification • Selects all CSO topics found in the top-10 similar words of the resulting n-grams (with cosine similarity > 0.7)
  • 43. Semantic Module Concept ranking • We assign a score to each identified topic: • Frequency – number of times it was inferred • Diversity – number of unique text chunks from which it was inferred Concept Selection • Elbow method CSO Topic score domain ontologies 40 semantic web 40 ontology learning 40 data mining 40 heterogeneous resources 24 semantics 24 world wide web 10 network architecture 6 scholarly communication 6 ontology matching 6 … …
  • 45. Post Processing Combination of output Semantic enhancement • We use the superTopicOf to enhance the output set • E.g., if “machine learning” then also “artificial intelligence” • Provides wider context for the analysed paper • Enables analytics on high-level abstract topics (e.g., digital libraries)
  • 47. Metadata Extraction – Springer Nature Use Case • Traditionally, editors choose a list of related keywords and categories in relevant taxonomies according to: • their own experience of similar conferences • a visual exploration of titles and abstracts • a list of terms given by the curators or derived by calls for papers Salatino, A. A., Osborne, F., Birukou, A., & Motta, E. (2019, October). Improving editorial workflow and metadata quality at springer nature. In International Semantic Web Conference (pp. 507-525). Springer, Cham.
  • 48. Classification of Proceedings – A Complex Problem Classify publications manually presents a number of issues for a large editor such as Springer Nature. • It a complex process that require expert editors • It is time-consuming process which can hardly scale • It is easy to miss the emergence of new topics • It is easy to assume that some traditional topics are still popular when this is no longer the case • The keywords used in the call of papers are often a reflection of what a venue aspires to be, rather than the real contents of the proceedings.
  • 49. Smart Topic Miner Architecture Demo of STM: http://stm-demo.kmi.open.ac.uk SN Editors HTML - GUI Parser Generate Visualizations STM Engine CSO SNCs Historical Data i) CSO Classifier ii) Topic Explanation iii) Taxonomy Generation iv) SN Tags Inference v) Previous Classification word2vec model
  • 50. Business Value • STM halves the time needed for classifying proceedings from 30 to 15 minutes • It allows also junior editors to work on the classification of proceedings, distributing the load and reducing costs • It achieved an overall 75% cost reduction • The adoption of a controlled vocabulary makes the process more robust and facilitates the identification of related editorial products
  • 52. Recommendation of Books Identifying SN Books to be marketed at specific events, such as academic conferences • Manual book selection has some limitations: • Requires years of experience and domain-specific knowledge • Requires browsing through large catalogue of information - syntactic • Prone to biases • Aim: • Provide a more effective way to support the book selection process • Help drive the operating costs down • Semi-automated selection of the most appropriate books, journals, and proceedings to market at a scientific event • We developed the Smart Book Recommender: http://rexplore.kmi.open.ac.uk/SBR-demo Thanapalasingam, T., Osborne, F., Birukou, A., & Motta, E. (2018, October). Ontology-based recommendation of editorial products. In International Semantic Web Conference (pp. 341-358). Springer, Cham.
  • 53. • We characterised Books and Conference Proceedings through their research topics: • The metadata of chapters/papers (i.e. keywords, title and abstract) are mapped to research topics in CSO • It returns a set of research topics Workflow (1) Classifying Conferences & Editorial Products Computing Pairwise Similarity Querying & Visualizing results
  • 54. • Computing the cosine similarity of two editorial products using vectors of research topics and their weights Workflow (2) Classifying Conferences & Editorial Products Computing Pairwise Similarity Querying & Visualizing results
  • 55. Workflow (3) Classifying Conferences & Editorial Products Computing Pairwise Similarity Querying & Visualizing results • A web application makes AJAX requests to query a relational database and displays the recommendations • Visualisations are constructed in real- time using D3.js • Interactive analytics for comparing items
  • 56. Advanced Visualisation of a Book Topics
  • 58. Research Trends Forecast • We created a new approach for predicting the impact of a topic on industry. • It uses four temporal time-series: i) publications from academia, ii) publications from industry, iii) patents from academia, and iv) patents from industry. • We tested it on the task of predicting if an emergent research topic will have a significant impact on industry (> 50 patents) in the following 10 years. • This evaluation substantiates the hypothesis that considering the four timeseries separately is conducive to higher quality predictions and suggests that RI and RA are good indicators for PI.
  • 59. Data modelling pipeline Research Papers Patents Fine-grained representation of research topics Computer Science Ontology Filtering documents Filtering documents CSO Classifier Extraction of affiliation types Peoples' Friendship University of Russia Salatino, A., Osborne, F., & Motta, E. (2020, September). Researchflow: Understanding the knowledge flow between academia and industry. In International Conference on Knowledge Engineering and Knowledge Management (pp. 219-236). Springer, Cham.
  • 60. Scholarly Data++ Improving Editorial Workflow and Metadata Quality at Springer Nature. Identifying the research topics that best describe the scope of a scientific publication is a crucial task for editors, in particular because the quality of these annotations determine how effectively users are able to discover the right content in online libraries. For this reason, Springer Nature, the world’s largest academic book publisher, has traditionally entrusted this task to their most expert editors. These editors manually analyse all new books, possibly including hundreds of chapters, and produce a list of the most relevant topics. Hence, this process has traditionally been very expensive, time-consuming, and confined to a few senior editors. For these reasons, back in 2016 we developed Smart Topic Miner (STM), an ontology- driven application that assists the Springer Nature editorial team in annotating the volumes of all books covering conference proceedings in Computer Science. Since then STM has been regularly used by editors in Germany, China, Brazil, India, and Japan, … Angelo Salatino Francesco Osborne Aliaksandr Birukou Enrico Motta The Open University Springer Nature The 18th International Semantic Web Conference (ISWC 2019) Affiliations Authors Citations References Conference/Journal Text: Title, Abstract, Keywords scholarly data, semantic web, data mining, ontology, digital libraries, … Topics Affiliation Types Academia Industry Keywords Scholarly data, Bibliographic metadata, Topic classification,
  • 61. Research Topic Each research topic is represented through 4 signals: Papers from Academia (RA) Papers from Industry (RI) Patents from Academia (PA) Patents from Industry (PI)
  • 62. Machine Learning approach We used: • Logistic Regression (LR) • Random Forest (RF) • AdaBoost (AB) • Convoluted Neural Network (CNN) • Long Short-term Memory Neural Network (LSTM) On several combinations of time-series: RA, RI, PA and PI
  • 65. Conference Dashboard Angioni, Simone, et al. "The AIDA Dashboard: Analysing Conferences with Semantic Technologies."
  • 69. Conclusions • We have seen how effective is the Computer Science Ontology Framework • It enables us to combine machine learning algorithms and semantic technologies to produce high-level applications for gaining insights in the field of Science of Science • Future work: applying this framework in other domains of Science Corpus of Research Papers Klink-2 Algorithm Computer Science Ontology CSO Classifier High-level Applications Corpus of Research Papers Klink-2 Algorithm Computer Science Ontology CSO Classifier High-level Applications