Untangling the State of the Art in Artificial Intelligence by Applying Text Mining and Network Analysis

Artificial intelligence is being applied in a wide variety of areas and has generated a lot of attention over the past few years. Keeping up to date with the advances in this broad field is challenging because of the sheer volume of scientific output. This blog post provides an interactive overview of the most impactful recent AI research through a combination of text mining and network analysis.

Data and Search Strategy

The data used for this analysis was collected using The Lens, a free and open website which serves and integrates scholarly and patent data. The search strategy was quite straightforward: we focused on scientific publications which mention “artificial intelligence”, its subdomains “machine learning” or “deep learning” or the term “neural network” in their titles, with a publication date in 2018. The initial dataset consisted of 7139 documents. To focus on the most impactful articles, the data was then filtered to only include articles with 2 or more citations. This leads to a final dataset of 3676 articles.

The number of citations received by an article can be seen as a key indicator of its overall impact. All articles in the filtered dataset have gathered citations already, despite their relatively short publication lifespans. In this analysis, we see these articles as constituting the ‘state of the art’ in AI research based on the impact they’ve had within a short time frame.

Identifying Research Clusters

The identification of groups or clusters of related documents was carried out using a combination of text analysis and network analysis. The key assumption here is that documents can be linked when they share similar content in their abstracts. TFIDF (term frequency – inverse document frequency) models are a common technique in text analysis to assign numerical vectors to documents (here, their abstracts), which can then be used to calculate pairwise similarities between all documents in the dataset. This is what was done during the first step of the analysis carried out here.

After obtaining the similarity scores for all pairs of documents, the next step of the analysis focused on visualizing the results in an interactive network visualization to explore the dataset. The networks were created using Kenelyze, Kenedict’s network visualization platform. Links were initially drawn between documents when their cosine similarity exceeded 0.1. Clusters were then identified using a community detection algorithm and used to color the nodes in the network. The size of nodes reflects the number of citations the document has received from other documents.

Exploring and Annotating the Network

The graphic above shows the results of the analysis. Clusters of related content were labeled based on manual examination of the documents. As expected, we can see a wide variety of themes and application areas. The left-hand side of the visual shows output relating to advances in various types of neural networks and model optimization, while more practical applications in health care, chemistry and biology can be found on the right-hand side of the network. ‘Classic’ AI tasks such as image and video classification can be found in the clusters at the bottom center of the graphic.

Emerging research areas such as Quantum Machine Learning are represented as well, with various connections to a cluster on molecular/atomic properties:

2018 also saw significant energy-related output, varying from documents on wind speed prediction and streamflow forecasting to general energy forecasting models:

Detection and classification of various types of cancer is an active research area as well:

Interactively Exploring the Visualization

Network visualizations are an excellent way to exploratively dive into a dataset. The interactive visual below allows you to search and zoom to individual documents or filter by keywords of interest. For example, try typing ‘quantum’ in the ‘Filter network’ box to see where documents relating to quantum machine learning are located. Clicking a document shows its properties in a panel and allows you to read its abstract. You can find a full-screen version of the visual here.

Combining Network Analysis and Text Mining for New Insights

Mapping science or IP output in network visualizations is often based on the use of readily available metadata such as citations, author keywords, or listed authors and affiliations. Text mining of documents, as carried out for this blog post, can provide a valuable additional perspective on published output in an area of interest. Grouping output based on shared content/language allows for quick identification of key themes in a dataset and is an excellent way to explore trends and connections between clusters.

If you have any questions about this analysis or would like to apply a similar approach to your own area of interest, please let me know: andre.vermeij@kenedict.com.

By André Vermeij on 09/04/2019 / Blog / Leave a comment

Cookie	Duration	Description
AWSALBCORS	7 days	This cookie is used for load balancing services provded by Amazon inorder to optimize the user experience. Amazon has updated the ALB and CLB so that customers can continue to use the CORS request with stickness.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_gtag_UA_44195702_7	1 minute	This cookie is set by Google and is used to distinguish users.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.

Cookie	Duration	Description
AWSALB	7 days	AWSALB is a cookie generated by the Application load balancer in the Amazon Web Services. It works slightly different from AWSELB.
LENS_SESSION_ID		No description
logged_in	1 year	No description
uzdbm_a		No description
_octo	1 year	No description
__uzma	6 months	No description
__uzmb	6 months	No description
__uzmc	6 months	No description
__uzmd	6 months	No description

Untangling the State of the Art in Artificial Intelligence by Applying Text Mining and Network Analysis

Leave a Reply

On the blog

Kenedict’s mission

Contact Kenedict

Privacy Policy