zhinengresearch Improvement Plan

Based on: 8-Dimension Code Review Report Date: 2026-03-23 Current Score: 3.1/5.0 Target Score: 4.5/5.0

Priority Matrix

| Priority | Task | Impact | Effort | ROI | |----------|-------|---------|-----| | P0 | Add exception handling | High | 2h | Very High | | P0 | Add docstrings | High | 4h | High | | P1 | Split large files | Medium | 4h | Medium | | P1 | Add configuration management | Medium | 2h | Medium | | P2 | Add unit tests | High | 8h | Medium | | P2 | Implement mixed precision | Medium | 2h | Low |

Phase 1: Quick Wins (6 hours)

1. Add Exception Handling (2 hours)

Files: train.py, prepare.py

Changes:

# train.py:267-271
def main():
    """主训练函数"""
    try:
        print('='*60)
        print('智能研究框架 - 训练')
        print('='*60)

        device = 'cuda' if torch.cuda.is_available() else 'cpu'
        print(f'\n使用设备: {device}')

        print('\n加载数据...')
        train_loader, val_loader = get_dataloaders(batch_size=BATCH_SIZE)
        print(f'训练批数: {len(train_loader)}')
        print(f'验证批数: {len(val_loader)}')

        # ... rest of code

    except FileNotFoundError as e:
        print(f'❌ 文件未找到: {e}')
        print('请先运行 python prepare.py')
        sys.exit(1)
    except RuntimeError as e:
        print(f'❌ 运行时错误: {e}')
        print('请检查CUDA设置或使用CPU')
        sys.exit(1)
    except KeyboardInterrupt:
        print('\n\n⚠️  训练被用户中断')
        sys.exit(0)
    except Exception as e:
        print(f'❌ 未知错误: {e}')
        import traceback
        traceback.print_exc()
        sys.exit(1)

# prepare.py:158-164
def get_dataloaders(batch_size: int = BATCH_SIZE) -> Tuple[DataLoader, DataLoader]:
    """获取训练和验证数据加载器

    Args:
        batch_size: 批大小

    Returns:
        Tuple of (train_loader, val_loader)

    Raises:
        FileNotFoundError: If data shards not found
    """
    try:
        shard_files = sorted(DATA_SHARDS_DIR.glob('train_shard_*.npy'))

        if not shard_files:
            raise FileNotFoundError(
                f'No data shards found in {DATA_SHARDS_DIR}. '
                f'Run prepare.py first.'
            )

        train_data = np.concatenate([
            np.load(f) for f in tqdm(shard_files, desc='Loading shards')
        ])

        val_data = np.load(DATA_SHARDS_DIR / 'val.npy')

        train_dataset = TextDataset(train_data, SEQ_LENGTH)
        val_dataset = TextDataset(val_data, SEQ_LENGTH)

        train_loader = DataLoader(
            train_dataset,
            batch_size=batch_size,
            shuffle=True,
            num_workers=2,
            pin_memory=True
        )
        val_loader = DataLoader(
            val_dataset,
            batch_size=EVAL_BATCH_SIZE,
            shuffle=False,
            num_workers=2,
            pin_memory=True
        )

        return train_loader, val_loader

    except Exception as e:
        print(f'❌ 加载数据时出错: {e}')
        raise

Success Criteria: - All file operations wrapped in try-except - User-friendly error messages - Graceful exit on errors

2. Add Docstrings (4 hours)

Files: All classes and methods

Template:

class CausalSelfAttention(nn.Module):
    """Causal self-attention mechanism for Transformer.

    This implements multi-head self-attention with causal masking
    to prevent attending to future positions. Uses efficient
    QKV projection (single matrix multiply).

    Attributes:
        d_model: Model dimension
        n_heads: Number of attention heads
        d_k: Dimension per attention head
    """

    def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1) -> None:
        """Initialize causal self-attention layer.

        Args:
            d_model: Model dimension (must be divisible by n_heads)
            n_heads: Number of attention heads
            dropout: Dropout probability

        Raises:
            AssertionError: If d_model is not divisible by n_heads
        """
        super().__init__()
        assert d_model % n_heads == 0, "d_model必须能被n_heads整除"

        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        self.qkv_proj = nn.Linear(d_model, 3 * d_model)
        self.output_proj = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        """Forward pass through causal self-attention.

        Args:
            x: Input tensor of shape (batch_size, seq_len, d_model)
            mask: Optional attention mask (not used, causal masking is automatic)

        Returns:
            Output tensor of shape (batch_size, seq_len, d_model)
        """
        batch_size, seq_len, _ = x.shape

        qkv = self.qkv_proj(x)
        qkv = qkv.view(batch_size, seq_len, 3, self.n_heads, self.d_k)
        qkv = qkv.permute(2, 0, 3, 1, 4)

        q, k, v = qkv[0], qkv[1], qkv[2]

        scores = torch.matmul(q, k.transpose(-2, -1)) / (self.d_k ** 0.5)

        causal_mask = torch.triu(
            torch.ones(seq_len, seq_len, device=x.device), diagonal=1
        ).bool()
        scores = scores.masked_fill(causal_mask, float('-inf'))

        attn_weights = torch.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        attn_output = torch.matmul(attn_weights, v)
        attn_output = attn_output.permute(0, 2, 1, 3)
        attn_output = attn_output.reshape(batch_size, seq_len, self.d_model)

        output = self.output_proj(attn_output)

        return output

Success Criteria: - All classes have docstrings - All public methods have docstrings - Docstrings follow Google style (Args, Returns, Raises) - 100% docstring coverage

Phase 2: Refactoring (4 hours)

3. Split Large Files (4 hours)

Target Structure:

zhinengresearch/
├── prepare.py              # Main entry point (simplified)
├── train.py                # Main entry point (simplified)
├── model/
│   ├── __init__.py
│   ├── attention.py         # CausalSelfAttention
│   ├── blocks.py           # TransformerBlock, FeedForward
│   └── language_model.py    # LanguageModel
├── data/
│   ├── __init__.py
│   ├── tokenizer.py         # train_bpe_tokenizer, get_tokenizer
│   ├── dataset.py          # TextDataset
│   └── dataloader.py      # get_dataloaders, create_data_shards
└── utils/
    ├── __init__.py
    └── evaluation.py       # evaluate_bpb

File: model/init.py

"""Model components for zhinengresearch."""

from .attention import CausalSelfAttention
from .blocks import FeedForward, TransformerBlock
from .language_model import LanguageModel

__all__ = [
    'CausalSelfAttention',
    'FeedForward',
    'TransformerBlock',
    'LanguageModel'
]

File: model/attention.py

"""Causal self-attention implementation."""

import torch
import torch.nn as nn
from typing import Optional

class CausalSelfAttention(nn.Module):
    # ... (full implementation with docstrings)

Success Criteria: - No file > 200 lines - Clear separation of concerns - Easy to test individual components

Phase 3: Testing (8 hours)

4. Add Unit Tests (8 hours)

Target Structure:

tests/
├── __init__.py
├── conftest.py                    # Pytest fixtures
├── test_model/
│   ├── __init__.py
│   ├── test_attention.py
│   ├── test_blocks.py
│   └── test_language_model.py
├── test_data/
│   ├── __init__.py
│   ├── test_tokenizer.py
│   ├── test_dataset.py
│   └── test_dataloader.py
└── test_training/
    ├── __init__.py
    └── test_loss.py

Example: test_model/test_attention.py

"""Tests for CausalSelfAttention."""

import pytest
import torch
from model.attention import CausalSelfAttention


class TestCausalSelfAttention:
    """Test suite for CausalSelfAttention."""

    @pytest.fixture
    def model(self):
        """Create a test model."""
        return CausalSelfAttention(d_model=64, n_heads=4)

    def test_forward_shape(self, model):
        """Test forward pass output shape."""
        batch_size = 2
        seq_len = 10
        d_model = 64

        x = torch.randn(batch_size, seq_len, d_model)
        output = model(x)

        assert output.shape == (batch_size, seq_len, d_model)

    def test_causal_masking(self, model):
        """Test that causal masking prevents looking ahead."""
        batch_size = 1
        seq_len = 5
        d_model = 64

        x = torch.eye(seq_len).unsqueeze(0).expand(batch_size, -1, -1)
        output = model(x)

        # Check that each position only depends on previous positions
        for i in range(seq_len):
            for j in range(i + 1, seq_len):
                # Position i should not affect position j
                assert not torch.allclose(
                    output[0, i, :],
                    output[0, j, :],
                    atol=1e-5
                )

    def test_divisible_d_model(self):
        """Test that d_model must be divisible by n_heads."""
        with pytest.raises(AssertionError):
            CausalSelfAttention(d_model=65, n_heads=4)

Success Criteria: - 80% code coverage - All public APIs tested - Tests pass consistently

Phase 4: Optimization (2 hours)

5. Mixed Precision Training (2 hours)

Changes to train.py:

def train_one_epoch(model, train_loader, optimizer, device='cuda', epoch=0, scaler=None):
    """训练一个epoch（支持混合精度）"""
    model.train()
    total_loss = 0
    total_tokens = 0

    pbar = tqdm(train_loader, desc=f'Epoch {epoch}')

    for batch_idx, (x, y) in enumerate(pbar):
        x = x.to(device)
        y = y.to(device)

        # 使用混合精度
        if scaler is not None and device == 'cuda':
            with torch.cuda.amp.autocast():
                logits, _ = model(x)
                loss = torch.nn.functional.cross_entropy(
                    logits.view(-1, logits.size(-1)),
                    y.view(-1),
                    reduction='mean'
                )
            scaler.scale(loss).backward()
            if GRADIENT_CLIP > 0:
                scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(model.parameters(), GRADIENT_CLIP)
            scaler.step(optimizer)
            scaler.update()
        else:
            # 标准精度
            logits, _ = model(x)
            loss = torch.nn.functional.cross_entropy(
                logits.view(-1, logits.size(-1)),
                y.view(-1),
                reduction='mean'
            )
            optimizer.zero_grad()
            loss.backward()
            if GRADIENT_CLIP > 0:
                torch.nn.utils.clip_grad_norm_(model.parameters(), GRADIENT_CLIP)
            optimizer.step()

        total_loss += loss.item()
        total_tokens += y.numel()

        pbar.set_postfix({'loss': f'{loss.item():.4f}'})

    avg_loss = total_loss / len(train_loader)
    return avg_loss

def main():
    """主训练函数"""
    # ... setup code ...

    # 创建梯度缩放器
    from torch.cuda.amp import GradScaler
    scaler = GradScaler() if device == 'cuda' else None

    # 训练循环
    while time.time() - start_time < TRAIN_TIME_BUDGET:
        epoch += 1

        train_loss = train_one_epoch(
            model, train_loader, optimizer, device, epoch, scaler
        )

        # ... rest of code ...

Expected Impact: - 2x training speed (on V100/A100) - Lower memory usage - Same or better accuracy

Success Metrics

Before Improvements

Metric	Value
Overall Score	3.1/5.0
Docstring Coverage	~10%
Exception Handling	Minimal
File Max Lines	377
Test Coverage	0%

After Improvements

Metric	Target
Overall Score	4.5/5.0
Docstring Coverage	100%
Exception Handling	Comprehensive
File Max Lines	<200
Test Coverage	80%

Timeline

Phase	Duration	Target Date
Phase 1: Quick Wins	6 hours	Day 1
Phase 2: Refactoring	4 hours	Day 2
Phase 3: Testing	8 hours	Days 3-4
Phase 4: Optimization	2 hours	Day 4
Total	20 hours	4 days

Risk Assessment

Low Risk

✅ Splitting files (backward compatible with imports)
✅ Adding docstrings (no functional change)

Medium Risk

⚠️ Adding exception handling (may introduce new code paths)
⚠️ Mixed precision (may affect numerical stability)

Mitigation Strategies

Run tests after each phase
Maintain backward compatibility
Compare results before/after changes
Gradual rollout with monitoring

Acceptance Criteria

Phase 1 complete when: - [x] All file operations have try-except blocks - [x] User-friendly error messages - [x] All classes and methods have Google-style docstrings

Phase 2 complete when: - [ ] All files < 200 lines - [ ] Clear module structure - [ ] Easy to test individual components

Phase 3 complete when: - [ ] 80% code coverage - [ ] All tests pass - [ ] CI/CD runs tests automatically

Phase 4 complete when: - [ ] Mixed precision implemented - [ ] 2x speed improvement confirmed - [ ] Same accuracy maintained

Conclusion

This improvement plan will elevate zhinengresearch from "Good" (3.1/5.0) to "Excellent" (4.5/5.0) in just 20 hours of focused work.

Key Benefits: - Better maintainability (docstrings, smaller files) - Robust error handling - High confidence in correctness (tests) - Faster training (mixed precision)

Next Steps: 1. Start with Phase 1 (exception handling and docstrings) 2. Get team feedback 3. Proceed to Phase 2-4 based on priorities

Plan Date: 2026-03-23 Author: LingFlow Code Review Framework Version: v3.3.0