Author: Umair Abbas Siddiquie
Email: [email protected]
GitHub: @trizist
- Overview
- Core Concepts
- Data Classes Deep Dive
- Working with Classifications
- Query System
- Date Handling
- Practical Examples
- Best Practices
- Integration Patterns
- Troubleshooting
This module provides the foundational data structures for an arXiv search service. It's designed to handle academic paper metadata, search queries, and result formatting with type safety and performance in mind.
from typing import Any, Optional, List, Dict
from datetime import datetime, date
from pytz import timezone
from mypy_extensions import TypedDict
from dataclasses import dataclass, field
from arxiv import taxonomyEvery arXiv paper has:
- Unique identifier (e.g., "2301.12345")
- Metadata (title, authors, abstract, classifications)
- Version history (papers can be updated)
- Full text content (PDF/source files)
The system supports:
- Simple field-based searches
- Complex classification filtering
- Date range queries
- Paginated results
@dataclass
class DocMeta:
# Core identification
paper_id: str = "" # e.g., "2301.12345"
title: str = "" # Paper title
title_utf8: str = "" # UTF-8 encoded title
# Author information
authors: str = "" # Author names as string
authors_utf8: str = "" # UTF-8 encoded authors
authors_parsed: List[Dict] = [] # Structured author data
author_owners: List[Dict] = [] # Author ownership info
# Content
abstract: str = "" # Paper abstract
abstract_utf8: str = "" # UTF-8 encoded abstract
# Dates (all as strings in ISO format)
submitted_date: str = "" # Current version submission
submitted_date_all: List[str] = [] # All submission dates
modified_date: str = "" # Last modification
updated_date: str = "" # Last update
announced_date_first: str = "" # First announcement
# Status flags
is_current: bool = True # Is this the current version?
is_withdrawn: bool = False # Has paper been withdrawn?
# Classifications
primary_classification: Dict[str, str] = {}
secondary_classification: List[Dict[str, str]] = []
abs_categories: str = "" # Category string
# Publication info
journal_ref: str = "" # Journal reference
journal_ref_utf8: str = "" # UTF-8 encoded journal ref
doi: str = "" # Digital Object Identifier
comments: str = "" # Author comments
comments_utf8: str = "" # UTF-8 encoded comments
# Technical metadata
version: int = 1 # Paper version number
latest_version: int = 1 # Latest available version
latest: str = "" # Latest version identifier
formats: List[str] = [] # Available formats (pdf, ps, etc.)
source: Dict[str, Any] = {} # Source file information
# Administrative
metadata_id: int = -1 # Internal metadata ID
document_id: int = -1 # Internal document ID
submitter: Dict[str, str] = {} # Submitter information
license: Dict[str, str] = {} # License information
proxy: bool = False # Is this a proxy submission?
# External classifications
msc_class: str = "" # Mathematics Subject Classification
acm_class: str = "" # ACM Computing Classification
report_num: str = "" # Report number@dataclass
class Fulltext:
content: str # The extracted full text
version: str # Version identifier
created: datetime # When the extraction was performedUse cases:
- Full-text search capabilities
- Content analysis and processing
- Version tracking for text changes
arXiv uses a three-level classification system:
Group → Archive → Category
↓ ↓ ↓
cs → cs.AI → cs.AI
(Computer Science) → (Artificial Intelligence) → (AI papers)
# Individual classification component
class ClassificationPart(TypedDict):
id: str # e.g., "cs", "cs.AI"
name: str # e.g., "Computer Science", "Artificial Intelligence"
# Complete classification
class Classification(TypedDict):
group: Optional[ClassificationPart] # Top level (cs, math, physics)
archive: Optional[ClassificationPart] # Mid level (cs.AI, math.NT)
category: Optional[ClassificationPart] # Specific category# Example classification structure
classification = {
'group': {'id': 'cs', 'name': 'Computer Science'},
'archive': {'id': 'cs.AI', 'name': 'Artificial Intelligence'},
'category': {'id': 'cs.AI', 'name': 'Artificial Intelligence'}
}
# ClassificationList provides string representation
classifications = ClassificationList([classification])
print(str(classifications)) # "Computer Science - Artificial Intelligence"@dataclass
class Query:
# Pagination
size: int = 50 # Results per page (max 2000)
page_start: int = 0 # Starting offset
# Display options
order: Optional[str] = None # Sort order
include_older_versions: bool = False
hide_abstracts: bool = False
# Computed properties
@property
def page_end(self) -> int:
return self.page_start + self.size
@property
def page(self) -> int:
return 1 + int(round(self.page_start/self.size))@dataclass
class SimpleQuery(Query):
search_field: str = "" # Field to search in
value: str = "" # Search term
classification: ClassificationList = ClassificationList()
include_cross_list: bool = True # Include secondary classificationsThe system supports these searchable fields:
| Field | Description | Example |
|---|---|---|
all |
All fields | General search |
title |
Paper title | "neural networks" |
author |
Author names | "Smith, John" |
abstract |
Abstract text | "machine learning" |
comments |
Author comments | "preliminary results" |
journal_ref |
Journal reference | "Nature 2023" |
paper_id |
arXiv identifier | "2301.12345" |
doi |
DOI | "10.1038/s41586-023-12345-6" |
orcid |
Author ORCID | "0000-0000-0000-0000" |
@dataclass
class DateRange:
start_date: datetime = datetime(1990, 1, 1, tzinfo=EASTERN)
end_date: datetime = datetime.now(tz=EASTERN)
date_type: str = 'submitted_date' # What date to filter on
# Date type constants
SUBMITTED_ORIGINAL = 'submitted_date_first' # Original submission
SUBMITTED_CURRENT = 'submitted_date' # Current version
ANNOUNCED = 'announced_date_first' # Announcement date- submitted_date_first: When the paper was originally submitted to arXiv
- submitted_date: When the current version was submitted
- announced_date_first: When the paper was first announced/published
from datetime import datetime
from pytz import timezone
EASTERN = timezone('US/Eastern')
# Create a DocMeta instance for a new paper
paper = DocMeta(
paper_id="2301.12345",
title="Deep Learning for Climate Modeling",
authors="Smith, J.; Johnson, A.; Williams, B.",
abstract="This paper presents a novel approach to climate modeling using deep learning techniques...",
submitted_date="2023-01-15",
primary_classification={
'id': 'cs.LG',
'name': 'Machine Learning'
},
version=1,
is_current=True
)# Simple title search
title_query = SimpleQuery(
search_field="title",
value="neural networks",
size=25,
page_start=0
)
# Author search with classification filter
author_query = SimpleQuery(
search_field="author",
value="Hinton",
classification=ClassificationList([{
'group': {'id': 'cs', 'name': 'Computer Science'},
'archive': {'id': 'cs.LG', 'name': 'Machine Learning'},
'category': {'id': 'cs.LG', 'name': 'Machine Learning'}
}]),
size=50
)from datetime import datetime
# Papers from the last year
recent_papers = DateRange(
start_date=datetime(2023, 1, 1, tzinfo=EASTERN),
end_date=datetime.now(tz=EASTERN),
date_type=DateRange.SUBMITTED_CURRENT
)
# Original submissions from 2022
original_2022 = DateRange(
start_date=datetime(2022, 1, 1, tzinfo=EASTERN),
end_date=datetime(2022, 12, 31, tzinfo=EASTERN),
date_type=DateRange.SUBMITTED_ORIGINAL
)# Convert DocMeta to dictionary for JSON serialization
paper_dict = asdict(paper)
# Access nested data
primary_cat = paper.primary_classification.get('name', 'Unknown')
first_author = paper.authors_parsed[0]['name'] if paper.authors_parsed else 'Unknown'
# Check paper status
if paper.is_withdrawn:
print("This paper has been withdrawn")
elif not paper.is_current:
print(f"This is version {paper.version}, latest is {paper.latest_version}")- Use TypedDict for classifications (performance optimization)
- The
asdict()function efficiently converts dataclasses to dictionaries
- Always use timezone-aware datetime objects
- Default to US/Eastern timezone (arXiv's timezone)
- Use appropriate date_type for your use case
- Validate field names against SUPPORTED_FIELDS
- Respect the MAXIMUM_size limit (2000 results)
- Use pagination for large result sets
- Use include_cross_list=True to include papers with secondary classifications
- Build classification filters incrementally (group → archive → category)
# Validate query size
if query.size > Query.MAXIMUM_size:
query.size = Query.MAXIMUM_size
# Handle missing metadata gracefully
title = paper.title or paper.title_utf8 or "Untitled"
authors = paper.authors or "Unknown authors"class ArxivSearchService:
def __init__(self):
self.eastern = timezone('US/Eastern')
def search(self, query: SimpleQuery) -> List[DocMeta]:
# Implementation would connect to arXiv API/database
pass
def get_paper(self, paper_id: str) -> Optional[DocMeta]:
# Retrieve single paper by ID
pass
def get_fulltext(self, paper_id: str, version: int = None) -> Optional[Fulltext]:
# Retrieve full text content
passdef process_search_results(results: List[DocMeta]) -> Dict[str, Any]:
"""Process and aggregate search results."""
processed = {
'total_papers': len(results),
'by_category': {},
'by_year': {},
'authors': set()
}
for paper in results:
# Aggregate by category
cat_id = paper.primary_classification.get('id', 'unknown')
processed['by_category'][cat_id] = processed['by_category'].get(cat_id, 0) + 1
# Aggregate by year
if paper.submitted_date:
year = paper.submitted_date[:4]
processed['by_year'][year] = processed['by_year'].get(year, 0) + 1
# Collect unique authors
for author_dict in paper.authors_parsed:
if 'name' in author_dict:
processed['authors'].add(author_dict['name'])
processed['authors'] = list(processed['authors'])
return processedfrom functools import lru_cache
from typing import Tuple
class CachedArxivService:
@lru_cache(maxsize=1000)
def _search_cache_key(self, field: str, value: str,
classifications: Tuple, size: int,
start: int) -> List[DocMeta]:
"""Cached search with tuple-based key for immutability."""
query = SimpleQuery(
search_field=field,
value=value,
classification=ClassificationList(classifications),
size=size,
page_start=start
)
return self._execute_search(query)Problem: Non-ASCII characters in titles/abstracts
# Solution: Use UTF-8 fields when available
title = paper.title_utf8 or paper.title
abstract = paper.abstract_utf8 or paper.abstractProblem: String dates need conversion to datetime objects
from datetime import datetime
def parse_arxiv_date(date_str: str) -> Optional[datetime]:
"""Parse arXiv date string to datetime object."""
if not date_str:
return None
try:
# arXiv dates are typically in format: "2023-03-15"
return datetime.strptime(date_str, "%Y-%m-%d").replace(tzinfo=EASTERN)
except ValueError:
# Handle alternative formats
try:
return datetime.strptime(date_str, "%Y-%m-%d %H:%M:%S").replace(tzinfo=EASTERN)
except ValueError:
return NoneProblem: Inconsistent classification data structure
def normalize_classification(cls_dict: Dict) -> Classification:
"""Normalize classification dictionary."""
normalized = Classification(
group=None,
archive=None,
category=None
)
if 'id' in cls_dict:
parts = cls_dict['id'].split('.')
if len(parts) >= 1:
normalized['group'] = {
'id': parts[0],
'name': cls_dict.get('name', parts[0])
}
if len(parts) >= 2:
archive_id = '.'.join(parts[:2])
normalized['archive'] = {
'id': archive_id,
'name': cls_dict.get('name', archive_id)
}
normalized['category'] = {
'id': cls_dict['id'],
'name': cls_dict.get('name', cls_dict['id'])
}
return normalizedProblem: Large DocMeta objects consuming too much memory
def create_lightweight_result(paper: DocMeta) -> Dict[str, Any]:
"""Create lightweight version of paper metadata."""
return {
'paper_id': paper.paper_id,
'title': paper.title or paper.title_utf8,
'authors': paper.authors,
'submitted_date': paper.submitted_date,
'primary_category': paper.primary_classification.get('id'),
'abstract_preview': (paper.abstract or paper.abstract_utf8)[:200] + "..."
}- Use TypedDict for frequently accessed data: The classification system uses TypedDict for better performance
- Implement result caching: Cache frequently accessed papers and search results
- Lazy loading: Load full text content only when needed
- Batch operations: Process multiple papers together when possible
- Index optimization: If building a database, index on commonly searched fields
def debug_paper(paper: DocMeta) -> None:
"""Print detailed paper information for debugging."""
print(f"Paper ID: {paper.paper_id}")
print(f"Title: {paper.title}")
print(f"Authors: {paper.authors}")
print(f"Submitted: {paper.submitted_date}")
print(f"Primary Category: {paper.primary_classification}")
print(f"Version: {paper.version}/{paper.latest_version}")
print(f"Current: {paper.is_current}, Withdrawn: {paper.is_withdrawn}")
print(f"Abstract length: {len(paper.abstract)} chars")
print("-" * 50)
def validate_query(query: SimpleQuery) -> List[str]:
"""Validate query parameters and return list of issues."""
issues = []
if query.size > Query.MAXIMUM_size:
issues.append(f"Size {query.size} exceeds maximum {Query.MAXIMUM_size}")
if query.search_field not in [field[0] for field in Query.SUPPORTED_FIELDS]:
issues.append(f"Unsupported search field: {query.search_field}")
if not query.value.strip():
issues.append("Empty search value")
return issuesThis comprehensive guide covers all aspects of the arXiv search service base classes. The code is designed for scalability, type safety, and performance, making it suitable for production use in academic search applications.