PubMedBridge User Guide

1. Overview

PubMedBridge is a suite of web-based tools designed to streamline the preprocessing of bibliometric data from PubMed for bibliometric analysis. It provides a user-friendly interface to parse raw PubMed data, convert it into structured formats, and prepare it for visualization and analysis software.

Key Features

User-Selected Metadata Fields – Choose from 34 metadata options grouped into core citation, author information, affiliations, country, publication content, publication details, and links
Defensive Resolution – Prioritizes precision over coverage, flagging ambiguous cases rather than forcing uncertain assignments
Transparent Output – All country resolutions are exported with their resolution method for verification and manual correction
Spreadsheet-Centric Workflow – Intermediate results exported as .xlsx for easy inspection and correction
Client-Side Processing – All operations execute locally in your browser, ensuring complete data privacy
Open Source – Fully transparent codebase available for inspection

2. System Architecture and Workflow

PubMedBridge is architected around two integrated modules that work together in a human-in-the-loop curation workflow, where users verify and, when necessary, manually correct country assignments.

Step 1

Input & Automated Resolution

Upload PubMed .txt file
↓
PubMed2XLSX processes
↓
Export .xlsx spreadsheet

→

Step 2

Human Validation

Review in Excel/Sheets
↓
Verify countries
↓
Filter & curate data

→

Step 3

Output Generation

Upload curated .xlsx
↓
XLSX2PubMed converts
↓
Export PubMed .txt

Workflow Details

1Input & Automated Resolution

Users upload a raw PubMed data file (.txt format) to the PubMed2XLSX tool and select metadata fields to be included in the output. The tool automatically parses the file, applies the country resolution algorithm to affiliation strings, and exports the results as a structured spreadsheet (.xlsx).

Key outputs:

Structured tabular data with user-selected metadata fields
Automated country assignments with resolution method labels
Flagged ambiguous or unresolved cases for manual review

2Human-in-the-Loop Validation & Curation

The exported spreadsheet serves as an auditable dataset that users can review and refine using standard spreadsheet software. This critical step integrates domain expertise where algorithmic resolution is uncertain.

Two types of curation:

Country Assignment Verification: Review records and manually correct or confirm country assignments based on contextual knowledge
Dataset Filtering: Apply filters based on metadata fields to construct tailored datasets:
- Remove ineligible records based on content of other fields (e.g., Abstract)
- Exclude publications outside specified date ranges
- Filter by author criteria or publication types
- Apply custom inclusion/exclusion criteria

After validation, users can perform preliminary analyses directly on the spreadsheet or proceed to Step 3.

3Output Generation

The XLSX2PubMed tool converts the curated spreadsheet back into PubMed format (.txt), ensuring compatibility with bibliometric analysis and visualization software such as VOSviewer, CiteSpace, and Bibliometrix.

Benefits:

Seamless integration with existing bibliometric workflows
Use of specialized visualization tools that require PubMed format
Sharing of curated datasets in standardized format
Closed-loop workflow maintaining original structure

3. Getting Started

3.1 Accessing PubMedBridge

No installation is required. You can access PubMedBridge at pubmedbridge.drmyo.com

Click "Launch the Tool" for either PubMed2XLSX or XLSX2PubMed
Begin processing your files

3.2 System Requirements

Modern web browser (Chrome, Firefox, Safari, or Edge recommended)
JavaScript enabled
Spreadsheet software for Step 2 curation (Excel, Google Sheets, LibreOffice Calc)

💡 Privacy Note: All processing occurs locally in your browser. Your data never leaves your computer, ensuring complete privacy and security.

4. Using PubMed2XLSX

4.1 Purpose

PubMed2XLSX resolve country names and parse metadata, and converts it into structured XLSX and JSON files. It's ideal for data validation, filtering and performance analysis.

4.2 Step-by-Step Instructions

Step 1: Prepare Your PubMed Data

Conduct your search on PubMed (pubmed.ncbi.nlm.nih.gov)
Click "Save" → "Save citations to file"
Save the file in "PubMed" format with .txt extension

Step 2: Upload File

Navigate to PubMed2XLSX tool
Click "Choose File" or drag and drop your .txt file
The tool will automatically detect the file format

Step 3: Select Metadata Fields

Choose which metadata fields to include in your output. Fields are organized into categories:

Category	Fields
Core Citation	PMID, Title, Journal, Journal Abbreviation, Publication Year, Volume, Pages
Author Information	First Author, Last Author, Co-Authors
Country	All Countries, First Author Country, Last Author Country, Co-Author Countries, Non-first Author Countries
Affiliations	All Affiliations, First Author Affiliation, Last Author Affiliation, Co-Author Affiliations, Non-first Author Affiliations
Publication Content	Abstract, Keywords, MeSH Terms, Major MeSH Terms, Publication Type, Country of Publication, Language
Publication Details	ISSN, PMCID, Secondary Source ID, Grant Numbers
Links	DOI, PubMed Link

💡 Tip: Select all fields initially. You can always filter or hide columns later in Excel.

Step 4: Process the File

Click "Process"
Wait for the processing to complete (progress bar will show status)
Large files may take several minutes

Step 5: Download Results

Two files will be generated:

.xlsx file – Structured spreadsheet for review and curation
.json file – Machine-readable format for advanced users

5. Data Curation and Validation

5.1 Opening the Spreadsheet

Open the generated .xlsx file in your preferred spreadsheet software:

Microsoft Excel
Google Sheets
LibreOffice Calc
Apple Numbers

💡 Best Practice - Improving Readability: To make manual review easier (e.g., Excel):

Select your data range (including headers)
Go to Insert → Table (or press Ctrl+T / Cmd+T)
Confirm the range and check "My table has headers"
Select the entire table
Go to Format Cells → Alignment
Enable "Wrap Text"
Adjust row heights as needed

This converts your data to a Table format (with filter dropdowns) and wraps text in cells. Fields like Country, Affiliations, and Authors will now display line-by-line, making it much easier to review multi-value entries and identify issues.

⚠️ Important: Do not modify the column headers in the XLSX file. These are required for the XLSX2PubMed tool to function correctly.

5.2 Understanding Country Resolution Methods

Each record includes a "Country Resolution Method" field that indicates how the country was determined:

Method	Description	Confidence
Direct Match	Country name identified.	High
alpha3	Country alpha3 code identified.	High
US State Name	US state name identified.	High
US State Abbreviatin	US state abbreviation identified.	High
Institution Name	Institution name matched in reference institution database.	Low
Institution City	Institution city ity matched in reference institution database.	Low
USGeorgiaToCheck	Cannot disambiguate between Georgia (US State) and Georgia (Country). Country could not be determined.	No
Institution Name Confusion	Identical institution name in more than one country. Country could not be determined.	No
Institution City Confusion	Identical insititution city in more than one country. Country could not be determined.	No
UNRESOLVED	Country could not be determined.	No
Contribution Note	Not affiliation string.	-
Filtered String	Not affiliation string.	-

5.3 Verifying Country Assignments

Priority Order

UNRESOLVED
Institution City Confusion and Institution Name Confusion
USGeorgiaToCheck
Institution City and Institution Name
US State Abbreviation and Name
alpha3
Direct Match

Manual Correction Process

Review the affiliation string
Identify the correct country using contextual information
Enter the country name in the Country column
Optionally update the Country Resolution Method to "Manual Correction"

💡 Best Practice: Use Excel's filter and sort features to group similar cases together for efficient batch review.

5.4 Dataset Filtering and Refinement

Common Filtering Scenarios

Date Range: Filter by Year to include only publications within your study period
Publication Type: Filter by Publication Type to include only original research articles
Language: Filter by Language if needed for your analysis
Abstract Content: Search within abstracts to identify relevant studies
Author Criteria: Filter by author names or affiliations

💡 Best Practice: Use PubMed's native filters for broad criteria (date, language, type) to reduce initial dataset size. Use PubMedBridge filtering for nuanced refinements based on country assignments, affiliation details, or complex content analysis after you've reviewed the data.

⚠️ Remember: Save your curated file before proceeding to XLSX2PubMed. Keep the original column headers unchanged.

6. Using XLSX2PubMed

6.1 Purpose

XLSX2PubMed takes a structured .xlsx file (generated and curated from PubMed2XLSX) and converts it back into standard PubMed format. This enables use with bibliometric analysis and visualization software.

6.2 Step-by-Step Instructions

Step 1: Prepare Your Curated File

Ensure all manual corrections are complete
Verify column headers remain unchanged
Save your .xlsx file

Step 2: Upload to XLSX2PubMed

Navigate to the XLSX2PubMed tool
Click "Choose File" or drag and drop your .xlsx file
The tool will validate the file format and convert to PubMed .txt format
Wait for processing to complete
Progress indicator will show conversion status

Step 4: Download Outputs

Two files will be generated:

.txt file – Standard PubMed format ready for analysis software
.json file – Structured data format for advanced applications

6.3 Using Your Output Files

The PubMed .txt file is ready to use with any bibliometric analysis software that accepts standard PubMed format, including VOSviewer, CiteSpace, Bibliometrix, and others. Refer to your analysis software's documentation for specific import instructions.

7. Troubleshooting

File Upload Fails

Problem: File won't upload or shows error message

Solutions:

Verify file is in correct format (.txt for PubMed2XLSX, .xlsx for XLSX2PubMed)
Check file isn't corrupted or empty
Try a different browser
Ensure JavaScript is enabled

Processing Takes Too Long

Problem: Tool appears frozen or processing doesn't complete

Solutions:

Large files (10,000+ records) may take several minutes
Check browser console for errors (F12 key)
Try splitting large files into smaller batches
Ensure sufficient RAM is available

Security Check Error with PMID

Problem: Processing fails with error referencing specific PMID

Solutions:

Locate the PMID mentioned in the error in your .txt file
Check that record for HTML tags, special characters, or unusual patterns
Common issues: <script> tags, excessive character repetition, HTML-like syntax
Edit the problematic text (typically in abstract or title)
Remove or replace the triggering content
Re-upload the modified file

Note: Security checks prevent harmful content injection. Technical abstracts (e.g., bioinformatics, computer science) may occasionally trigger false positives and require manual editing.

Long Text Truncated in XLSX

Problem: Very long abstracts or affiliation lists appear cut off in Excel

Cause: Excel cell limit. PubMedBridge limits cells to 25,000 characters for cross-platform compatibility.

Solutions:

Use the .json output for complete, untruncated text
Look up the PMID directly on PubMed for full content
Most abstracts are under this limit; truncation is rare

Note: For text analysis or NLP work, always use the JSON file which contains complete content.

Column Headers Modified Error

Problem: XLSX2PubMed rejects your file

Solutions:

Ensure you haven't renamed any column headers
Check for extra spaces in header names
Verify you're using the original file from PubMed2XLSX
If headers were modified, regenerate from original PubMed file

Downloaded File Won't Open

Problem: XLSX file appears corrupted or won't open

Solutions:

Ensure download completed fully
Try opening with different spreadsheet software
Clear browser cache and regenerate file
Check file size isn't zero bytes

8. Frequently Asked Questions

General Questions

Q: Is my data secure when using PubMedBridge?

A: Yes. All processing occurs entirely within your web browser. No data is uploaded to external servers. Your files remain on your computer throughout the entire workflow.

Q: Do I need to install any software?

A: No installation is required for PubMedBridge itself. You only need a modern web browser. However, you will need spreadsheet software (Excel, Google Sheets, etc.) for the curation step.

Q: What file size limits exist?

A: PubMedBridge has a 100MB file size limit to prevent browser freezing and ensure stable processing. In practice, this is rarely an issue - PubMed's maximum export of 10,000 records per file typically results in files well under this limit. The file size depends on abstract length and metadata completeness.

Q: Can I use this with databases other than PubMed?

A: Currently, PubMedBridge is optimized for PubMed format only. Support for other databases may be added in future versions.

Technical Questions

Q: Why are some affiliations marked as "UNRESOLVED"?

A: The algorithm prioritizes accuracy over completeness. If the country cannot be determined with reasonable confidence, it's flagged for manual review rather than guessed.

Q: Can I add custom country resolution rules?

A: The current version doesn't support custom rules through the interface, but the open-source code can be modified by advanced users.

Q: What database does PubMedBridge use to match institution names and cities?

A: PubMedBridge uses a unified reference dataset combining two major sources:

ROR (Research Organization Registry) v1.74: A community-led registry of 120,196 research organizations with curated institution names, aliases, cities, and countries
OpenAlex: An open catalog of scholarly entities, contributing 115,781 institutions (107,709 with country data) for broader coverage

These datasets are merged and deduplicated to create a comprehensive reference of 120,428 unique institutions worldwide. When the algorithm encounters an institution name or city in an affiliation string, it searches this database to identify the corresponding country.

Why combine both sources? ROR provides high-quality curated data, while OpenAlex offers broader coverage. Together, they maximize the algorithm's ability to correctly resolve affiliations from diverse institutions globally.

Q: How accurate is the automated country resolution?

A: Very high. The algorithm was validated using a stratified random sample of 430 affiliation strings from 9,931 PubMed articles (108,557 total affiliation strings).

Validation results:

High-confidence methods (direct country matches, alpha-3 codes, US state names and abbreviations, covering 96.2% of resolved affiliations): Manual verification of 100 randomly sampled records from each category confirmed all assignments were correct
Overall algorithm performance:
- 99.6% precision: Almost all country assignments made are correct
- 98.7% specificity: Correctly identifies truly ambiguous cases
- 66.4% recall: Successfully resolves about two-thirds of affiliations; the remaining third are flagged for manual review
- F1-score: 0.780: Reflects the balance between high precision and conservative recall

The algorithm uses a conservative, defensive approach - it flags uncertain cases for manual review rather than risk incorrect assignments. This design prioritizes data quality and ensures users can verify all country resolutions.

Workflow Questions

Q: Can I skip the manual curation step?

A: Technically yes, but this defeats the primary purpose of PubMedBridge. Human-in-the-loop validation with auditable output is the essence of the tool.

PubMedBridge is specifically designed around the principle that automated algorithms should be verified by domain experts, not blindly trusted. The tool provides:

Transparent resolution methods for every country assignment
Flagged uncertain cases that require expert judgment
Auditable spreadsheet format enabling verification and correction
Domain expertise integration where algorithmic confidence is low

While the algorithm achieves high accuracy (99.6% precision), the ~34% of cases flagged for review often represent complex, ambiguous, or critical affiliations that benefit from human verification. Skipping manual curation means accepting potentially incorrect assignments in these cases and missing opportunities to refine your dataset based on research-specific criteria.

⚠️ Recommended Practice: Always perform manual curation, even if brief. At minimum, review all "Unresolved" and "Confusion" cases, and verify that your dataset meets your specific inclusion/exclusion criteria.

Q: What if I need to make changes after converting back to PubMed format?

A: Keep your curated XLSX file. You can make changes there and re-run XLSX2PubMed to generate a new PubMed file.

Q: Can I merge multiple PubMed files?

A: Yes, as long as the merged total doesn't exceed 10,000 records (matching PubMed's download limit).

Process each file through PubMed2XLSX, then merge by copying/pasting rows in Excel. Remove duplicates by PMID, perform manual curation, then convert through XLSX2PubMed. If your total exceeds 10,000 records, keep files separate or split into batches.

Q: How do I cite PubMedBridge in my research?

A: Please cite PubMedBridge using the following format:

Citation:

Tha, M., & Khin, N. (2026). PubMedBridge: A Browser-Based Tool for Transparent, Auditable Country-Level Affiliation Resolution in PubMed Bibliometric Research (Version 1.0) [Computer software]. https://pubmedbridge.drmyo.com