Building an African Cultural Dataset with SmoLAgents: Experimental

Community Article Published February 7, 2025

Introduction

SmoLAgents provides a powerful framework for creating rich cultural datasets through a multi-agent system. This implementation focuses on collecting, organizing, and reasoning about African cultural knowledge using specialized AI agents.

System Architecture

!pip install smolagents[litellm] datasets

Model Configuration

self.model = LiteLLMModel(model_id="gpt-4o-mini")
self.reasoning_model = LiteLLMModel(model_id="o3-mini", reasoning_effort="high")
self.coder_model = LiteLLMModel(
    model_id="openrouter/anthropic/claude-3.5-sonnet",
    temperature=0.8
)
self.robust_model = LiteLLMModel(model_id="o1")

Specialized Agents

Research Agent

Equipped with web search and webpage visit capabilities
Uses high-capability model for complex reasoning
Maximum 6 processing steps for thorough research
Access to extensive data processing tools

self.researcher = CodeAgent(
            tools=[google_search, visit_webpage],
            model=self.coder_model,
            max_steps=6,
            verbosity_level=3,
            additional_authorized_imports=['math', 'queue', 'stat', 'statistics', 're', 'itertools', 'unicodedata', 'collections', 'datetime', 'time', 'random', 'bs4', 'markdownify', 'requests', 'pandas']
        )

async def research_cultural_info(self, category: str, topic: str) -> Dict:
        try:
            research_prompt = f"""
            You are an expert researcher on African History
            Research and provide comprehensive information about {topic} in African {category}.
            Focus on historical context, regional variations, and modern practices.
            """
            research_data = self.researcher.run(research_prompt)
            
            structure_prompt = f"""
            Based on this research: {research_data}
            Create a structured JSON with:
            {{
                "overview": "brief description",
                "historical_context": "historical background",
                "regional_variations": ["list of variations by region"],
                "cultural_significance": "detailed significance",
                "modern_practices": "current adaptations",
                "sources": ["list of sources"]
            }}
            """
            structured_data = await self.generate_with_model(structure_prompt)
            return json.loads(structured_data)
        except json.JSONDecodeError as e:
            print(f"JSON parsing error: {e}")
            return {}

QA Generator Agent

Creates culturally-aware questions and answers
Implements difficulty levels (basic/intermediate/advanced)
Ensures regional representation
Maintains cultural authenticity

async def generate_qa_pairs(self, cultural_data: Dict) -> List[Dict]:
        try:
            qa_prompt = f"""
            Based on this cultural information:
            {json.dumps(cultural_data, indent=2)}
            
            Generate 6 question-answer pairs in this JSON format:
            [{{
                "question": "detailed question",
                "answer": "comprehensive answer",
                "difficulty": "basic|intermediate|advanced",
                "category": "historical|practical|conceptual",
                "regions": ["relevant African regions"]
            }}]
            """
            qa_response = await self.generate_with_model(qa_prompt)
            return json.loads(qa_response)
        except Exception as e:
            print(f"QA generation error: {e}")
            return []

Reasoning Generator Agent

Produces detailed solution chains
Breaks down cultural concepts
Provides step-by-step analysis
Links historical and modern contexts

async def generate_reasoning(self, qa_pairs: List[Dict]) -> List[Dict]:
        try:
            reasoning_prompt = f"""
            For these Q&A pairs:
            {json.dumps(qa_pairs, indent=2)}
            
            Generate detailed reasoning chains in this JSON format:
            [{{
                "question": "original question",
                "reasoning_steps": [
                    "step 1: initial understanding",
                    "step 2: cultural context",
                    "step 3: analysis",
                    "step 4: conclusion"
                ],
                "final_answer": "detailed answer",
                "cultural_context": "relevant cultural background",
                "sources": ["reference sources"]
            }}]
            """
            reasoning_data = await self.generate_with_model(reasoning_prompt)
            return json.loads(reasoning_data)
        except Exception as e:
            print(f"Reasoning generation error: {e}")
            return []

Data Collection Pipeline

Cultural Research Phase

{
    "overview": "brief description",
    "historical_context": "historical background",
    "regional_variations": ["variations by region"],
    "cultural_significance": "detailed significance",
    "modern_practices": "current adaptations",
    "sources": ["reference sources"]
}

QA Generation Phase

{
    "question": "detailed question",
    "answer": "comprehensive answer",
    "difficulty": "basic|intermediate|advanced",
    "category": "historical|practical|conceptual",
    "regions": ["relevant African regions"]
}

Reasoning Chain Generation

{
    "question": "original question",
    "reasoning_steps": [
        "step 1: initial understanding",
        "step 2: cultural context",
        "step 3: analysis",
        "step 4: conclusion"
    ],
    "final_answer": "detailed answer",
    "cultural_context": "relevant background"
}

Full Code

import os
from typing import Dict, List, Any
import json
from datetime import datetime
import asyncio
from smolagents import CodeAgent, DuckDuckGoSearchTool, LiteLLMModel

class AfricanCultureDataGenerator:
    def __init__(self, api_key: str):
        # Initialize with explicit API key
        os.environ["OPENAI_API_KEY"] = api_key
        
        self.model = LiteLLMModel(
            model_id="gpt-4o-mini",
        )
        self.reasoning_model = LiteLLMModel(
            model_id="o3-mini",
            reasoning_effort="high",

        )

        self.coder_model = LiteLLMModel(
            model_id="openrouter/anthropic/claude-3.5-sonnet",
            api_key=os.environ["OPENROUTER_API_KEY"],
            temperature=0.8
        )

        self.robust_model = LiteLLMModel(
            model_id="o1",
        )
        
        # Research Agent
        self.researcher = CodeAgent(
            tools=[google_search, visit_webpage],
            model=self.coder_model,
            max_steps=6,
            verbosity_level=3,
            additional_authorized_imports=['math', 'queue', 'stat', 'statistics', 're', 'itertools', 'unicodedata', 'collections', 'datetime', 'time', 'random', 'bs4', 'markdownify', 'requests', 'pandas']
        )
        
        self.categories = {
            "traditions": [
                "marriage ceremonies",
                "naming ceremonies",
                "initiation rituals"
                "storytelling",
                "science"
            ],
            "music": [
                "traditional instruments",
                "musical styles",
                "dance forms",
                "ceremonial music"
            ],
            "social_structures": [
                "family systems",
                "leadership roles",
                "age groups",
               "community organization"
            ],
            "cultural_values": [
                "respect for elders",
                "community solidarity",
                "spiritual beliefs",
                "oral traditions"
            ]
        }
        
    async def generate(self, prompt: str) -> str:
      agent = CodeAgent(
          tools=[], 
          model=self.model,
          max_steps=6,
          additional_authorized_imports=['bs4', 'stat', 'statistics', 'unicodedata', 'collections', 'requests', 'time', 'json', 'time', 'os','random', 'math', 'queue', 'markdownify', 're', 'itertools', 'datetime', 'pandas']
      )
      # Get the agent's response.
      response = agent.run(prompt)
      # If the response is a dictionary, convert it to a JSON string.
      if isinstance(response, dict):
          return json.dumps(response)
      # Otherwise, return the response as is.
      return response
    
    async def generate_with_model(self, prompt: str) -> str:
        try:
            response = await self.generate(prompt)
            return response if response else "{}"
        except Exception as e:
            print(f"Model generation error: {e}")
            return "{}"

    async def research_cultural_info(self, category: str, topic: str) -> Dict:
        try:
            research_prompt = f"""
            You are an expert researcher on African History
            Research and provide comprehensive information about {topic} in African {category}.
            Focus on historical context, regional variations, and modern practices.
            """
            research_data = self.researcher.run(research_prompt)
            
            structure_prompt = f"""
            Based on this research: {research_data}
            Create a structured JSON with:
            {{
                "overview": "brief description",
                "historical_context": "historical background",
                "regional_variations": ["list of variations by region"],
                "cultural_significance": "detailed significance",
                "modern_practices": "current adaptations",
                "sources": ["list of sources"]
            }}
            """
            structured_data = await self.generate_with_model(structure_prompt)
            return json.loads(structured_data)
        except json.JSONDecodeError as e:
            print(f"JSON parsing error: {e}")
            return {}

    async def generate_qa_pairs(self, cultural_data: Dict) -> List[Dict]:
        try:
            qa_prompt = f"""
            Based on this cultural information:
            {json.dumps(cultural_data, indent=2)}
            
            Generate 6 question-answer pairs in this JSON format:
            [{{
                "question": "detailed question",
                "answer": "comprehensive answer",
                "difficulty": "basic|intermediate|advanced",
                "category": "historical|practical|conceptual",
                "regions": ["relevant African regions"]
            }}]
            """
            qa_response = await self.generate_with_model(qa_prompt)
            return json.loads(qa_response)
        except Exception as e:
            print(f"QA generation error: {e}")
            return []

    async def generate_reasoning(self, qa_pairs: List[Dict]) -> List[Dict]:
        try:
            reasoning_prompt = f"""
            For these Q&A pairs:
            {json.dumps(qa_pairs, indent=2)}
            
            Generate detailed reasoning chains in this JSON format:
            [{{
                "question": "original question",
                "reasoning_steps": [
                    "step 1: initial understanding",
                    "step 2: cultural context",
                    "step 3: analysis",
                    "step 4: conclusion"
                ],
                "final_answer": "detailed answer",
                "cultural_context": "relevant cultural background",
                "sources": ["reference sources"]
            }}]
            """
            reasoning_data = await self.generate_with_model(reasoning_prompt)
            return json.loads(reasoning_data)
        except Exception as e:
            print(f"Reasoning generation error: {e}")
            return []

    async def process_category(self, category: str, topic: str) -> Dict:
        try:
            cultural_data = await self.research_cultural_info(category, topic)
            qa_pairs = await self.generate_qa_pairs(cultural_data)
            reasoning_data = await self.generate_reasoning(qa_pairs)
            
            return {
                "category": category,
                "topic": topic,
                "cultural_data": cultural_data,
                "qa_pairs": qa_pairs,
                "reasoning_data": reasoning_data,
                "metadata": {
                    "generated_at": datetime.now().isoformat(),
                    "model": "gpt-family/o3",
                    "version": "1.0"
                }
            }
        except Exception as e:
            print(f"Error processing {category}/{topic}: {e}")
            return {"error": str(e)}

    async def generate_dataset(self):
        dataset = {}
        for category, topics in self.categories.items():
            dataset[category] = {}
            for topic in topics:
                print(f"Processing {category}/{topic}...")
                dataset[category][topic] = await self.process_category(category, topic)
                await asyncio.sleep(2)
        
        with open("african_cultural_dataset.json", "w", encoding="utf-8") as f:
            json.dump(dataset, f, indent=2, ensure_ascii=False)
        
        return dataset

async def main():
    api_key =   os.environ["OPENAI_API_KEY"]
    generator = AfricanCultureDataGenerator(api_key)
    dataset = await generator.generate_dataset()
    print("Dataset generation complete!")

if __name__ == "__main__":
  await main()

Conclusion

This implementation demonstrates the power of specialized AI agents in creating rich African cultural datasets, leveraging a multi-agent architecture for research, QA generation, and reasoning chains. While the current implementation shows promise, transitioning to an E2B code executor with an Orchestrator would offer several advantages:

Better execution control and resource management
Improved error handling and API key management
Parallel processing of cultural data collection
Scalable infrastructure for larger datasets
Enhanced monitoring and validation capabilities

The next phase should focus on:

Implementing an Orchestrator to manage agent workflows
Utilizing E2B's code execution environment for reliable processing
Adding robust validation mechanisms for cultural accuracy
Implementing parallel data collection across regions
Enhancing the reasoning chain generation with distributed processing

This evolution would maintain the current system's cultural authenticity while adding enterprise-grade reliability and scalability.

Resources

African Cultural Dataset : https://huggingface.co/datasets/Svngoku/african_cultural_data
African Cultural QA Dataset: https://huggingface.co/datasets/Svngoku/african_cultural_qa_pairs
African Cultural Reasoning Datatset : https://huggingface.co/datasets/Svngoku/african_cultural_reasoning

Community

vianmixt

1 day ago

Congrats Bro
You're inevitable

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote