Building an African Cultural Dataset with SmoLAgents: Experimental

Community Article Published February 7, 2025

image/jpeg

Introduction

SmoLAgents provides a powerful framework for creating rich cultural datasets through a multi-agent system. This implementation focuses on collecting, organizing, and reasoning about African cultural knowledge using specialized AI agents.

System Architecture

!pip install smolagents[litellm] datasets

Model Configuration

self.model = LiteLLMModel(model_id="gpt-4o-mini")
self.reasoning_model = LiteLLMModel(model_id="o3-mini", reasoning_effort="high")
self.coder_model = LiteLLMModel(
    model_id="openrouter/anthropic/claude-3.5-sonnet",
    temperature=0.8
)
self.robust_model = LiteLLMModel(model_id="o1")

Specialized Agents

Research Agent

  • Equipped with web search and webpage visit capabilities
  • Uses high-capability model for complex reasoning
  • Maximum 6 processing steps for thorough research
  • Access to extensive data processing tools
self.researcher = CodeAgent(
            tools=[google_search, visit_webpage],
            model=self.coder_model,
            max_steps=6,
            verbosity_level=3,
            additional_authorized_imports=['math', 'queue', 'stat', 'statistics', 're', 'itertools', 'unicodedata', 'collections', 'datetime', 'time', 'random', 'bs4', 'markdownify', 'requests', 'pandas']
        )

async def research_cultural_info(self, category: str, topic: str) -> Dict:
        try:
            research_prompt = f"""
            You are an expert researcher on African History
            Research and provide comprehensive information about {topic} in African {category}.
            Focus on historical context, regional variations, and modern practices.
            """
            research_data = self.researcher.run(research_prompt)
            
            structure_prompt = f"""
            Based on this research: {research_data}
            Create a structured JSON with:
            {{
                "overview": "brief description",
                "historical_context": "historical background",
                "regional_variations": ["list of variations by region"],
                "cultural_significance": "detailed significance",
                "modern_practices": "current adaptations",
                "sources": ["list of sources"]
            }}
            """
            structured_data = await self.generate_with_model(structure_prompt)
            return json.loads(structured_data)
        except json.JSONDecodeError as e:
            print(f"JSON parsing error: {e}")
            return {}

QA Generator Agent

  • Creates culturally-aware questions and answers
  • Implements difficulty levels (basic/intermediate/advanced)
  • Ensures regional representation
  • Maintains cultural authenticity
async def generate_qa_pairs(self, cultural_data: Dict) -> List[Dict]:
        try:
            qa_prompt = f"""
            Based on this cultural information:
            {json.dumps(cultural_data, indent=2)}
            
            Generate 6 question-answer pairs in this JSON format:
            [{{
                "question": "detailed question",
                "answer": "comprehensive answer",
                "difficulty": "basic|intermediate|advanced",
                "category": "historical|practical|conceptual",
                "regions": ["relevant African regions"]
            }}]
            """
            qa_response = await self.generate_with_model(qa_prompt)
            return json.loads(qa_response)
        except Exception as e:
            print(f"QA generation error: {e}")
            return []

Reasoning Generator Agent

  • Produces detailed solution chains
  • Breaks down cultural concepts
  • Provides step-by-step analysis
  • Links historical and modern contexts
async def generate_reasoning(self, qa_pairs: List[Dict]) -> List[Dict]:
        try:
            reasoning_prompt = f"""
            For these Q&A pairs:
            {json.dumps(qa_pairs, indent=2)}
            
            Generate detailed reasoning chains in this JSON format:
            [{{
                "question": "original question",
                "reasoning_steps": [
                    "step 1: initial understanding",
                    "step 2: cultural context",
                    "step 3: analysis",
                    "step 4: conclusion"
                ],
                "final_answer": "detailed answer",
                "cultural_context": "relevant cultural background",
                "sources": ["reference sources"]
            }}]
            """
            reasoning_data = await self.generate_with_model(reasoning_prompt)
            return json.loads(reasoning_data)
        except Exception as e:
            print(f"Reasoning generation error: {e}")
            return []

Data Collection Pipeline

Cultural Research Phase

{
    "overview": "brief description",
    "historical_context": "historical background",
    "regional_variations": ["variations by region"],
    "cultural_significance": "detailed significance",
    "modern_practices": "current adaptations",
    "sources": ["reference sources"]
}

QA Generation Phase

{
    "question": "detailed question",
    "answer": "comprehensive answer",
    "difficulty": "basic|intermediate|advanced",
    "category": "historical|practical|conceptual",
    "regions": ["relevant African regions"]
}

Reasoning Chain Generation

{
    "question": "original question",
    "reasoning_steps": [
        "step 1: initial understanding",
        "step 2: cultural context",
        "step 3: analysis",
        "step 4: conclusion"
    ],
    "final_answer": "detailed answer",
    "cultural_context": "relevant background"
}

Full Code

import os
from typing import Dict, List, Any
import json
from datetime import datetime
import asyncio
from smolagents import CodeAgent, DuckDuckGoSearchTool, LiteLLMModel

class AfricanCultureDataGenerator:
    def __init__(self, api_key: str):
        # Initialize with explicit API key
        os.environ["OPENAI_API_KEY"] = api_key
        
        self.model = LiteLLMModel(
            model_id="gpt-4o-mini",
        )
        self.reasoning_model = LiteLLMModel(
            model_id="o3-mini",
            reasoning_effort="high",

        )

        self.coder_model = LiteLLMModel(
            model_id="openrouter/anthropic/claude-3.5-sonnet",
            api_key=os.environ["OPENROUTER_API_KEY"],
            temperature=0.8
        )

        self.robust_model = LiteLLMModel(
            model_id="o1",
        )
        
        # Research Agent
        self.researcher = CodeAgent(
            tools=[google_search, visit_webpage],
            model=self.coder_model,
            max_steps=6,
            verbosity_level=3,
            additional_authorized_imports=['math', 'queue', 'stat', 'statistics', 're', 'itertools', 'unicodedata', 'collections', 'datetime', 'time', 'random', 'bs4', 'markdownify', 'requests', 'pandas']
        )
        
        self.categories = {
            "traditions": [
                "marriage ceremonies",
                "naming ceremonies",
                "initiation rituals"
                "storytelling",
                "science"
            ],
            "music": [
                "traditional instruments",
                "musical styles",
                "dance forms",
                "ceremonial music"
            ],
            "social_structures": [
                "family systems",
                "leadership roles",
                "age groups",
               "community organization"
            ],
            "cultural_values": [
                "respect for elders",
                "community solidarity",
                "spiritual beliefs",
                "oral traditions"
            ]
        }
        
    async def generate(self, prompt: str) -> str:
      agent = CodeAgent(
          tools=[], 
          model=self.model,
          max_steps=6,
          additional_authorized_imports=['bs4', 'stat', 'statistics', 'unicodedata', 'collections', 'requests', 'time', 'json', 'time', 'os','random', 'math', 'queue', 'markdownify', 're', 'itertools', 'datetime', 'pandas']
      )
      # Get the agent's response.
      response = agent.run(prompt)
      # If the response is a dictionary, convert it to a JSON string.
      if isinstance(response, dict):
          return json.dumps(response)
      # Otherwise, return the response as is.
      return response
    
    async def generate_with_model(self, prompt: str) -> str:
        try:
            response = await self.generate(prompt)
            return response if response else "{}"
        except Exception as e:
            print(f"Model generation error: {e}")
            return "{}"

    async def research_cultural_info(self, category: str, topic: str) -> Dict:
        try:
            research_prompt = f"""
            You are an expert researcher on African History
            Research and provide comprehensive information about {topic} in African {category}.
            Focus on historical context, regional variations, and modern practices.
            """
            research_data = self.researcher.run(research_prompt)
            
            structure_prompt = f"""
            Based on this research: {research_data}
            Create a structured JSON with:
            {{
                "overview": "brief description",
                "historical_context": "historical background",
                "regional_variations": ["list of variations by region"],
                "cultural_significance": "detailed significance",
                "modern_practices": "current adaptations",
                "sources": ["list of sources"]
            }}
            """
            structured_data = await self.generate_with_model(structure_prompt)
            return json.loads(structured_data)
        except json.JSONDecodeError as e:
            print(f"JSON parsing error: {e}")
            return {}

    async def generate_qa_pairs(self, cultural_data: Dict) -> List[Dict]:
        try:
            qa_prompt = f"""
            Based on this cultural information:
            {json.dumps(cultural_data, indent=2)}
            
            Generate 6 question-answer pairs in this JSON format:
            [{{
                "question": "detailed question",
                "answer": "comprehensive answer",
                "difficulty": "basic|intermediate|advanced",
                "category": "historical|practical|conceptual",
                "regions": ["relevant African regions"]
            }}]
            """
            qa_response = await self.generate_with_model(qa_prompt)
            return json.loads(qa_response)
        except Exception as e:
            print(f"QA generation error: {e}")
            return []

    async def generate_reasoning(self, qa_pairs: List[Dict]) -> List[Dict]:
        try:
            reasoning_prompt = f"""
            For these Q&A pairs:
            {json.dumps(qa_pairs, indent=2)}
            
            Generate detailed reasoning chains in this JSON format:
            [{{
                "question": "original question",
                "reasoning_steps": [
                    "step 1: initial understanding",
                    "step 2: cultural context",
                    "step 3: analysis",
                    "step 4: conclusion"
                ],
                "final_answer": "detailed answer",
                "cultural_context": "relevant cultural background",
                "sources": ["reference sources"]
            }}]
            """
            reasoning_data = await self.generate_with_model(reasoning_prompt)
            return json.loads(reasoning_data)
        except Exception as e:
            print(f"Reasoning generation error: {e}")
            return []

    async def process_category(self, category: str, topic: str) -> Dict:
        try:
            cultural_data = await self.research_cultural_info(category, topic)
            qa_pairs = await self.generate_qa_pairs(cultural_data)
            reasoning_data = await self.generate_reasoning(qa_pairs)
            
            return {
                "category": category,
                "topic": topic,
                "cultural_data": cultural_data,
                "qa_pairs": qa_pairs,
                "reasoning_data": reasoning_data,
                "metadata": {
                    "generated_at": datetime.now().isoformat(),
                    "model": "gpt-family/o3",
                    "version": "1.0"
                }
            }
        except Exception as e:
            print(f"Error processing {category}/{topic}: {e}")
            return {"error": str(e)}

    async def generate_dataset(self):
        dataset = {}
        for category, topics in self.categories.items():
            dataset[category] = {}
            for topic in topics:
                print(f"Processing {category}/{topic}...")
                dataset[category][topic] = await self.process_category(category, topic)
                await asyncio.sleep(2)
        
        with open("african_cultural_dataset.json", "w", encoding="utf-8") as f:
            json.dump(dataset, f, indent=2, ensure_ascii=False)
        
        return dataset

async def main():
    api_key =   os.environ["OPENAI_API_KEY"]
    generator = AfricanCultureDataGenerator(api_key)
    dataset = await generator.generate_dataset()
    print("Dataset generation complete!")

if __name__ == "__main__":
  await main()

Conclusion

This implementation demonstrates the power of specialized AI agents in creating rich African cultural datasets, leveraging a multi-agent architecture for research, QA generation, and reasoning chains. While the current implementation shows promise, transitioning to an E2B code executor with an Orchestrator would offer several advantages:

  1. Better execution control and resource management
  2. Improved error handling and API key management
  3. Parallel processing of cultural data collection
  4. Scalable infrastructure for larger datasets
  5. Enhanced monitoring and validation capabilities

The next phase should focus on:

  • Implementing an Orchestrator to manage agent workflows
  • Utilizing E2B's code execution environment for reliable processing
  • Adding robust validation mechanisms for cultural accuracy
  • Implementing parallel data collection across regions
  • Enhancing the reasoning chain generation with distributed processing

This evolution would maintain the current system's cultural authenticity while adding enterprise-grade reliability and scalability.

Resources

Community

Congrats Bro
You're inevitable

Sign up or log in to comment