PubMed Knowledge Graph Datasets

Dataset Name PKG2020S4 (1781-Dec. 2020), Version 4
Description The new version PKG, PKG2020S4 (1781-Dec. 2020), updated the previous PKG version with PubMed 2021 baseline files, PubMed daily updates files (up to Jan. 4th 2021), and extracted bio-entities, author disambiguation results, extended author information, Scimago that containing journal information, and WOS citations which contains reference relations between PMID and reference PMID and extracted from WOS.

Database Features: 1-PKG2020S4 (1781-Dec. 2020) Features.pdf
Database Description: 2-PKG2020S4 (1781-Dec. 2020) Database Description.pdf

Data and Schema Visualizations
  1. Schema Visualization

Dataset Merge Instructions:
  1. Download all the 27 compressed files as well as their MD5 files in the folder "PKG2020S4_MySQL"
  2. Verify each compressed file with their corresponding MD5 file. For example, we can use the following command to verify if the file "PKG2020S4_A01_Articles.sql.gz" download without any damage:
    md5sum -c PKG2020S4_A01_Articles.sql.gz.md5sum
  3. Create a new Database in your MySQL server, and make sure the new database "Charset" is set to: utf8mb4, and "Order rule" is set to: utf8mb4_bin.
  4. Next, you can inject every table into the target database using the command like:
    gunzip < PKG2020S4_A01_Articles.sql.gz | mysql -uusername -ppassword destinationDatabaseName
Download URLs MD5Sum Description
A01_Articles MD5Sum Specific information for each article
A02_AuthorList MD5Sum Specific information for each author
A03_KeywordList MD5Sum Article keyword information:keyword information in this table is provided by the data producer
A04_Abstract MD5Sum The abstract of each article
A05_GrantList MD5Sum Grants details of each article
A06_MeshHeadingList MD5Sum Mesh Heading details of each article. Mesh Heading refers to the NLM control vocabulary and medical subject heading (MeSH®), which is used to characterize the content of the articles represented by MEDLINE citations.
A07_SupplMeshList MD5Sum Supplementary conceptual terms and protocol terms for each article
A08_ChemicalList MD5Sum The chemical substances and registry number covered in each article. Registry Number refers to a code assigned by Chemical Abstracts Service to a specific chemical substance.
A09_CommentsCorrectionsList MD5Sum Reference information for each article, including the source, type, and PMID of the reference
A10_DataBankList MD5Sum The search number of the molecular sequence database that appears in the PubMed article. The search number can find the information of the corresponding chemical molecule from the established molecular sequence database, avoiding the use of lengthy molecular formulas and graphics in the article.
A11_PersonalNameSubjectList MD5Sum Specific information for each author(this table is not as complete as A02)
A12_InvestigatorList MD5Sum Each article corresponds to the NASA-funded principal investigator (PI) information, and they participated in the discussion and research of the article (but not necessarily the author)
A13_AffiliationList MD5Sum Extracted affiliation information
A14_ReferenceList MD5Sum Reference information (we use the information in this table to generate the author's self-cited record during the disambiguation of strong features)
B01_Descriptor MD5Sum Specific information of Descriptor Name of each article (used for the classification of articles)
B02_projectlist_NIH MD5Sum Contains NIH funding information
B03_map_PMID_ProjID MD5Sum The related information of the funded project and the funded article PMID published on the NIH official website.
B07_ORCID_Main MD5Sum PubMed related author information extracted from ORCID
B08_ORCID_Education MD5Sum Employment information of scientific personnel from ORCID dataset
B09_ORCID_Employment MD5Sum SEducation information of scientific personnel from ORCID dataset
B10_BERN_Main MD5Sum Entity information set extracted from document titles and abstracts using BioBert
B11_BERN_EntityType MD5Sum Entity Type dictionary
B12_BERN_Mutation MD5Sum Mutation information set extracted from literature titles and abstracts using BioBert
B14_Scimago MD5Sum Data Set of Bibliometric Indicators for Ranking Journals Based on Citation Source Informatione
C03_Affiliation_merge MD5Sum Merging B05_Vetle_Map and A13_AffiliatioinList information, including the parsed organization information, such as Zip code, Location, Country, etc.
C04_ReferenceList MD5Sum The C04_ReferenceList contains 633401975 citations from 23856949 articles. The sources of data integration include PubMed's own citation data, NIH's opencitation collection, opencitations(run by David Shotton and Silvio Peroni) and the citation data from WOS. Compared with PubMed's own citation data (the amount of data is 223261597), it increased by 410140378. Compared with the previous version of PKG, the WOS citation data (the amount of data is 447596685), it increased by 185805290.
C05_NIH_PubMed MD5Sum According to PMID and the author's name (full name of the last name and initials) from A02 and B02, match the author AID and NIH Principal Investigator (PI) number PIID to generate correspondence table C05, including PIID, AID, CORE_PROJECT_NUMBER, PIID, etc.
URL (CSV) CSV Description: Description About CSV Files.pdf

We built PKG with bio-entities extracted from PubMed abstracts, author name disambiguation results of PubMed authors, and the integrated multi-source information. (PubMed raw data are not included into CSV files)
Download URLs Description CSV file containing PubMed authors and AND_IDs. CSV file containing all types of extracted bio-entities by BioBERT. CSV file containing additional items of mutations from Bio-entities_Main file. CCSV file containing affiliations and their extracted fine-grained items. CSV file containing employment history from ORCID. CSV file containing educational background from ORCID. CSV file containing projects from NIH ExPORTER and mapping relation between PI_ID, PMID, and AND_ID.

Link to older versions of this dataset