1. Overview
PubMedBridge is a suite of web-based tools designed to streamline the preprocessing of bibliometric data from PubMed for bibliometric analysis. It provides a user-friendly interface to parse raw PubMed data, convert it into structured formats, and prepare it for visualization and analysis software.
Key Features
- User-Selected Metadata Fields – Choose from 34 metadata options grouped into core citation, author information, affiliations, country, publication content, publication details, and links
- Defensive Resolution – Prioritizes precision over coverage, flagging ambiguous cases rather than forcing uncertain assignments
- Transparent Output – All country resolutions are exported with their resolution method for verification and manual correction
- Spreadsheet-Centric Workflow – Intermediate results exported as .xlsx for easy inspection and correction
- Client-Side Processing – All operations execute locally in your browser, ensuring complete data privacy
- Open Source – Fully transparent codebase available for inspection
2. System Architecture and Workflow
PubMedBridge is architected around two integrated modules that work together in a human-in-the-loop curation workflow, where users verify and, when necessary, manually correct country assignments.
Step 1
Input & Automated Resolution
Upload PubMed .txt file
↓
PubMed2XLSX processes
↓
Export
.xlsx spreadsheet
Step 2
Human Validation
Review in Excel/Sheets
↓
Verify countries
↓
Filter
& curate data
Step 3
Output Generation
Upload curated .xlsx
↓
XLSX2PubMed converts
↓
Export
PubMed .txt
Workflow Details
1Input & Automated Resolution
Users upload a raw PubMed data file (.txt format) to the PubMed2XLSX tool and select metadata fields to be included in the output. The tool automatically parses the file, applies the country resolution algorithm to affiliation strings, and exports the results as a structured spreadsheet (.xlsx).
Key outputs:
- Structured tabular data with user-selected metadata fields
- Automated country assignments with resolution method labels
- Flagged ambiguous or unresolved cases for manual review
2Human-in-the-Loop Validation & Curation
The exported spreadsheet serves as an auditable dataset that users can review and refine using standard spreadsheet software. This critical step integrates domain expertise where algorithmic resolution is uncertain.
Two types of curation:
- Country Assignment Verification: Review records and manually correct or confirm country assignments based on contextual knowledge
-
Dataset Filtering: Apply filters based on
metadata fields to construct tailored datasets:
- Remove ineligible records based on content of other fields (e.g., Abstract)
- Exclude publications outside specified date ranges
- Filter by author criteria or publication types
- Apply custom inclusion/exclusion criteria
After validation, users can perform preliminary analyses directly on the spreadsheet or proceed to Step 3.
3Output Generation
The XLSX2PubMed tool converts the curated spreadsheet back into PubMed format (.txt), ensuring compatibility with bibliometric analysis and visualization software such as VOSviewer, CiteSpace, and Bibliometrix.
Benefits:
- Seamless integration with existing bibliometric workflows
- Use of specialized visualization tools that require PubMed format
- Sharing of curated datasets in standardized format
- Closed-loop workflow maintaining original structure
3. Getting Started
3.1 Accessing PubMedBridge
No installation is required. You can access PubMedBridge at pubmedbridge.drmyo.com
- Click "Launch the Tool" for either PubMed2XLSX or XLSX2PubMed
- Begin processing your files
3.2 System Requirements
- Modern web browser (Chrome, Firefox, Safari, or Edge recommended)
- JavaScript enabled
- Spreadsheet software for Step 2 curation (Excel, Google Sheets, LibreOffice Calc)
4. Using PubMed2XLSX
4.1 Purpose
PubMed2XLSX resolve country names and parse metadata, and converts it into structured XLSX and JSON files. It's ideal for data validation, filtering and performance analysis.
4.2 Step-by-Step Instructions
Step 1: Prepare Your PubMed Data
- Conduct your search on PubMed (pubmed.ncbi.nlm.nih.gov)
- Click "Save" → "Save citations to file"
- Save the file in "PubMed" format with .txt extension
Step 2: Upload File
- Navigate to PubMed2XLSX tool
- Click "Choose File" or drag and drop your .txt file
- The tool will automatically detect the file format
Step 3: Select Metadata Fields
Choose which metadata fields to include in your output. Fields are organized into categories:
| Category | Fields |
|---|---|
| Core Citation | PMID, Title, Journal, Journal Abbreviation, Publication Year, Volume, Pages |
| Author Information | First Author, Last Author, Co-Authors |
| Country | All Countries, First Author Country, Last Author Country, Co-Author Countries, Non-first Author Countries |
| Affiliations | All Affiliations, First Author Affiliation, Last Author Affiliation, Co-Author Affiliations, Non-first Author Affiliations |
| Publication Content | Abstract, Keywords, MeSH Terms, Major MeSH Terms, Publication Type, Country of Publication, Language |
| Publication Details | ISSN, PMCID, Secondary Source ID, Grant Numbers |
| Links | DOI, PubMed Link |
Step 4: Process the File
- Click "Process"
- Wait for the processing to complete (progress bar will show status)
- Large files may take several minutes
Step 5: Download Results
Two files will be generated:
- .xlsx file – Structured spreadsheet for review and curation
- .json file – Machine-readable format for advanced users
5. Data Curation and Validation
5.1 Opening the Spreadsheet
Open the generated .xlsx file in your preferred spreadsheet software:
- Microsoft Excel
- Google Sheets
- LibreOffice Calc
- Apple Numbers
- Select your data range (including headers)
- Go to Insert → Table (or press Ctrl+T / Cmd+T)
- Confirm the range and check "My table has headers"
- Select the entire table
- Go to Format Cells → Alignment
- Enable "Wrap Text"
- Adjust row heights as needed
5.2 Understanding Country Resolution Methods
Each record includes a "Country Resolution Method" field that indicates how the country was determined:
| Method | Description | Confidence |
|---|---|---|
| Direct Match | Country name identified. | High |
| alpha3 | Country alpha3 code identified. | High |
| US State Name | US state name identified. | High |
| US State Abbreviatin | US state abbreviation identified. | High |
| Institution Name | Institution name matched in reference institution database. | Low |
| Institution City | Institution city ity matched in reference institution database. | Low |
| USGeorgiaToCheck | Cannot disambiguate between Georgia (US State) and Georgia (Country). Country could not be determined. | No |
| Institution Name Confusion | Identical institution name in more than one country. Country could not be determined. | No |
| Institution City Confusion | Identical insititution city in more than one country. Country could not be determined. | No |
| UNRESOLVED | Country could not be determined. | No |
| Contribution Note | Not affiliation string. | - |
| Filtered String | Not affiliation string. | - |
5.3 Verifying Country Assignments
Priority Order
- UNRESOLVED
- Institution City Confusion and Institution Name Confusion
- USGeorgiaToCheck
- Institution City and Institution Name
- US State Abbreviation and Name
- alpha3
- Direct Match
Manual Correction Process
- Review the affiliation string
- Identify the correct country using contextual information
- Enter the country name in the Country column
- Optionally update the Country Resolution Method to "Manual Correction"
5.4 Dataset Filtering and Refinement
Common Filtering Scenarios
- Date Range: Filter by Year to include only publications within your study period
- Publication Type: Filter by Publication Type to include only original research articles
- Language: Filter by Language if needed for your analysis
- Abstract Content: Search within abstracts to identify relevant studies
- Author Criteria: Filter by author names or affiliations
6. Using XLSX2PubMed
6.1 Purpose
XLSX2PubMed takes a structured .xlsx file (generated and curated from PubMed2XLSX) and converts it back into standard PubMed format. This enables use with bibliometric analysis and visualization software.
6.2 Step-by-Step Instructions
Step 1: Prepare Your Curated File
- Ensure all manual corrections are complete
- Verify column headers remain unchanged
- Save your .xlsx file
Step 2: Upload to XLSX2PubMed
- Navigate to the XLSX2PubMed tool
- Click "Choose File" or drag and drop your .xlsx file
- The tool will validate the file format and convert to PubMed .txt format
- Wait for processing to complete
- Progress indicator will show conversion status
Step 4: Download Outputs
Two files will be generated:
- .txt file – Standard PubMed format ready for analysis software
- .json file – Structured data format for advanced applications
6.3 Using Your Output Files
The PubMed .txt file is ready to use with any bibliometric analysis software that accepts standard PubMed format, including VOSviewer, CiteSpace, Bibliometrix, and others. Refer to your analysis software's documentation for specific import instructions.
7. Troubleshooting
File Upload Fails
Problem: File won't upload or shows error message
Solutions:
- Verify file is in correct format (.txt for PubMed2XLSX, .xlsx for XLSX2PubMed)
- Check file isn't corrupted or empty
- Try a different browser
- Ensure JavaScript is enabled
Processing Takes Too Long
Problem: Tool appears frozen or processing doesn't complete
Solutions:
- Large files (10,000+ records) may take several minutes
- Check browser console for errors (F12 key)
- Try splitting large files into smaller batches
- Ensure sufficient RAM is available
Security Check Error with PMID
Problem: Processing fails with error referencing specific PMID
Solutions:
- Locate the PMID mentioned in the error in your .txt file
- Check that record for HTML tags, special characters, or unusual patterns
- Common issues: <script> tags, excessive character repetition, HTML-like syntax
- Edit the problematic text (typically in abstract or title)
- Remove or replace the triggering content
- Re-upload the modified file
Note: Security checks prevent harmful content injection. Technical abstracts (e.g., bioinformatics, computer science) may occasionally trigger false positives and require manual editing.
Long Text Truncated in XLSX
Problem: Very long abstracts or affiliation lists appear cut off in Excel
Cause: Excel cell limit. PubMedBridge limits cells to 25,000 characters for cross-platform compatibility.
Solutions:
- Use the .json output for complete, untruncated text
- Look up the PMID directly on PubMed for full content
- Most abstracts are under this limit; truncation is rare
Note: For text analysis or NLP work, always use the JSON file which contains complete content.
Column Headers Modified Error
Problem: XLSX2PubMed rejects your file
Solutions:
- Ensure you haven't renamed any column headers
- Check for extra spaces in header names
- Verify you're using the original file from PubMed2XLSX
- If headers were modified, regenerate from original PubMed file
Downloaded File Won't Open
Problem: XLSX file appears corrupted or won't open
Solutions:
- Ensure download completed fully
- Try opening with different spreadsheet software
- Clear browser cache and regenerate file
- Check file size isn't zero bytes
8. Frequently Asked Questions
General Questions
Q: Is my data secure when using PubMedBridge?
A: Yes. All processing occurs entirely within your web browser. No data is uploaded to external servers. Your files remain on your computer throughout the entire workflow.
Q: Do I need to install any software?
A: No installation is required for PubMedBridge itself. You only need a modern web browser. However, you will need spreadsheet software (Excel, Google Sheets, etc.) for the curation step.
Q: What file size limits exist?
A: PubMedBridge has a 100MB file size limit to prevent browser freezing and ensure stable processing. In practice, this is rarely an issue—PubMed's maximum export of 10,000 records per file typically results in files well under this limit. The file size depends on abstract length and metadata completeness.
Q: Can I use this with databases other than PubMed?
A: Currently, PubMedBridge is optimized for PubMed format only. Support for other databases may be added in future versions.
Technical Questions
Q: Why are some affiliations marked as "UNRESOLVED"?
A: The algorithm prioritizes accuracy over completeness. If the country cannot be determined with reasonable confidence, it's flagged for manual review rather than guessed.
Q: Can I add custom country resolution rules?
A: The current version doesn't support custom rules through the interface, but the open-source code can be modified by advanced users.
Q: Where can I inspect the country resolution algorithm code?
A: The country resolution algorithm is open-source and available at github.com/drmyo/pmbalgorithm. This repository provides an interactive, browser-based showcase of the exact country-resolution logic used in PubMedBridge.
What's included:
- Interactive interface for testing the algorithm with custom affiliation strings
- Complete source code showing the hierarchical, rule-based approach
- Validation documentation (VALIDATION.md) with study design and accuracy results
- Development guide (DEVELOPMENT.md) for contributors and maintainers
Note: This repository is specifically designed for validation, inspection, and methodological transparency. It is not the full PubMedBridge system and is not intended for production use - for actual data processing, use the main PubMedBridge tool at pubmedbridge.drmyo.com.
Q: What database does PubMedBridge use to match institution names and cities?
A: PubMedBridge uses a unified reference dataset combining two major sources:
- ROR (Research Organization Registry) v1.74: A community-led registry of 120,196 research organizations with curated institution names, aliases, cities, and countries
- OpenAlex: An open catalog of scholarly entities, contributing 115,781 institutions (107,709 with country data) for broader coverage
These datasets are merged and deduplicated to create a comprehensive reference of 120,428 unique institutions worldwide. When the algorithm encounters an institution name or city in an affiliation string, it searches this database to identify the corresponding country.
Why combine both sources? ROR provides high-quality curated data, while OpenAlex offers broader coverage. Together, they maximize the algorithm's ability to correctly resolve affiliations from diverse institutions globally.
Q: How accurate is the automated country resolution?
A: Very high. The algorithm was validated using a stratified random sample of 430 affiliation strings from 9,931 PubMed articles (108,557 total affiliation strings).
Validation results:
- High-confidence methods (direct country matches, alpha-3 codes, US state names and abbreviations, covering 96.2% of resolved affiliations): Manual verification of 100 randomly sampled records from each category confirmed all assignments were correct
-
Overall algorithm performance:
- 99.6% precision: Almost all country assignments made are correct
- 98.7% specificity: Correctly identifies truly ambiguous cases
- 66.4% recall: Successfully resolves about two-thirds of affiliations; the remaining third are flagged for manual review
- F1-score: 0.780: Reflects the balance between high precision and conservative recall
The algorithm uses a conservative, defensive approach—it flags uncertain cases for manual review rather than risk incorrect assignments. This design prioritizes data quality and ensures users can verify all country resolutions. Complete validation methodology, including stratified sampling strategy and detailed results, is available in the algorithm repository.
Workflow Questions
Q: Can I skip the manual curation step?
A: Technically yes, but this defeats the primary purpose of PubMedBridge. Human-in-the-loop validation with auditable output is the essence of the tool.
PubMedBridge is specifically designed around the principle that automated algorithms should be verified by domain experts, not blindly trusted. The tool provides:
- Transparent resolution methods for every country assignment
- Flagged uncertain cases that require expert judgment
- Auditable spreadsheet format enabling verification and correction
- Domain expertise integration where algorithmic confidence is low
While the algorithm achieves high accuracy (99.6% precision), the ~34% of cases flagged for review often represent complex, ambiguous, or critical affiliations that benefit from human verification. Skipping manual curation means accepting potentially incorrect assignments in these cases and missing opportunities to refine your dataset based on research-specific criteria.
Q: What if I need to make changes after converting back to PubMed format?
A: Keep your curated XLSX file. You can make changes there and re-run XLSX2PubMed to generate a new PubMed file.
Q: Can I merge multiple PubMed files?
A: Yes, as long as the merged total doesn't exceed 10,000 records (matching PubMed's download limit).
Process each file through PubMed2XLSX, then merge by copying/pasting rows in Excel. Remove duplicates by PMID, perform manual curation, then convert through XLSX2PubMed. If your total exceeds 10,000 records, keep files separate or split into batches.
Q: How do I cite PubMedBridge in my research?
A: Please cite PubMedBridge using the following format:
Citation:
Tha, M., & Khin, N. (2026). PubMedBridge: A Preprocessor for Auditable Country-Level Affiliation Resolution in Bibliometric Research (Version 1.0) [Computer software]. https://pubmedbridge.drmyo.com
For the country resolution algorithm specifically:
Tha, M., & Khin, N. (2026). PubMedBridge Country Resolution Algorithm [Computer software]. https://doi.org/10.5281/zenodo.18212014
BibTeX:
@software{tha2025pubmedbridge,
author = {Tha, Myo and Khin, Nilar},
title = {PubMedBridge: A Preprocessor for Auditable Country-Level Affiliation Resolution in Bibliometric Research},
year = {2026},
version = {1.0},
url = {https://pubmedbridge.drmyo.com},
note = {Web application}
}
If discussing the methodology or validation of country resolution specifically, also cite the open-source algorithm repository.