Building an African Cultural Dataset with SmoLAgents: Experimental
![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6168218a4ed0b975c18f82a8/A05SgyoW9R0_aXtayiiC4.jpeg)
Introduction
SmoLAgents provides a powerful framework for creating rich cultural datasets through a multi-agent system. This implementation focuses on collecting, organizing, and reasoning about African cultural knowledge using specialized AI agents.
System Architecture
!pip install smolagents[litellm] datasets
Model Configuration
self.model = LiteLLMModel(model_id="gpt-4o-mini")
self.reasoning_model = LiteLLMModel(model_id="o3-mini", reasoning_effort="high")
self.coder_model = LiteLLMModel(
model_id="openrouter/anthropic/claude-3.5-sonnet",
temperature=0.8
)
self.robust_model = LiteLLMModel(model_id="o1")
Specialized Agents
Research Agent
- Equipped with web search and webpage visit capabilities
- Uses high-capability model for complex reasoning
- Maximum 6 processing steps for thorough research
- Access to extensive data processing tools
self.researcher = CodeAgent(
tools=[google_search, visit_webpage],
model=self.coder_model,
max_steps=6,
verbosity_level=3,
additional_authorized_imports=['math', 'queue', 'stat', 'statistics', 're', 'itertools', 'unicodedata', 'collections', 'datetime', 'time', 'random', 'bs4', 'markdownify', 'requests', 'pandas']
)
async def research_cultural_info(self, category: str, topic: str) -> Dict:
try:
research_prompt = f"""
You are an expert researcher on African History
Research and provide comprehensive information about {topic} in African {category}.
Focus on historical context, regional variations, and modern practices.
"""
research_data = self.researcher.run(research_prompt)
structure_prompt = f"""
Based on this research: {research_data}
Create a structured JSON with:
{{
"overview": "brief description",
"historical_context": "historical background",
"regional_variations": ["list of variations by region"],
"cultural_significance": "detailed significance",
"modern_practices": "current adaptations",
"sources": ["list of sources"]
}}
"""
structured_data = await self.generate_with_model(structure_prompt)
return json.loads(structured_data)
except json.JSONDecodeError as e:
print(f"JSON parsing error: {e}")
return {}
QA Generator Agent
- Creates culturally-aware questions and answers
- Implements difficulty levels (basic/intermediate/advanced)
- Ensures regional representation
- Maintains cultural authenticity
async def generate_qa_pairs(self, cultural_data: Dict) -> List[Dict]:
try:
qa_prompt = f"""
Based on this cultural information:
{json.dumps(cultural_data, indent=2)}
Generate 6 question-answer pairs in this JSON format:
[{{
"question": "detailed question",
"answer": "comprehensive answer",
"difficulty": "basic|intermediate|advanced",
"category": "historical|practical|conceptual",
"regions": ["relevant African regions"]
}}]
"""
qa_response = await self.generate_with_model(qa_prompt)
return json.loads(qa_response)
except Exception as e:
print(f"QA generation error: {e}")
return []
Reasoning Generator Agent
- Produces detailed solution chains
- Breaks down cultural concepts
- Provides step-by-step analysis
- Links historical and modern contexts
async def generate_reasoning(self, qa_pairs: List[Dict]) -> List[Dict]:
try:
reasoning_prompt = f"""
For these Q&A pairs:
{json.dumps(qa_pairs, indent=2)}
Generate detailed reasoning chains in this JSON format:
[{{
"question": "original question",
"reasoning_steps": [
"step 1: initial understanding",
"step 2: cultural context",
"step 3: analysis",
"step 4: conclusion"
],
"final_answer": "detailed answer",
"cultural_context": "relevant cultural background",
"sources": ["reference sources"]
}}]
"""
reasoning_data = await self.generate_with_model(reasoning_prompt)
return json.loads(reasoning_data)
except Exception as e:
print(f"Reasoning generation error: {e}")
return []
Data Collection Pipeline
Cultural Research Phase
{
"overview": "brief description",
"historical_context": "historical background",
"regional_variations": ["variations by region"],
"cultural_significance": "detailed significance",
"modern_practices": "current adaptations",
"sources": ["reference sources"]
}
QA Generation Phase
{
"question": "detailed question",
"answer": "comprehensive answer",
"difficulty": "basic|intermediate|advanced",
"category": "historical|practical|conceptual",
"regions": ["relevant African regions"]
}
Reasoning Chain Generation
{
"question": "original question",
"reasoning_steps": [
"step 1: initial understanding",
"step 2: cultural context",
"step 3: analysis",
"step 4: conclusion"
],
"final_answer": "detailed answer",
"cultural_context": "relevant background"
}
Full Code
import os
from typing import Dict, List, Any
import json
from datetime import datetime
import asyncio
from smolagents import CodeAgent, DuckDuckGoSearchTool, LiteLLMModel
class AfricanCultureDataGenerator:
def __init__(self, api_key: str):
# Initialize with explicit API key
os.environ["OPENAI_API_KEY"] = api_key
self.model = LiteLLMModel(
model_id="gpt-4o-mini",
)
self.reasoning_model = LiteLLMModel(
model_id="o3-mini",
reasoning_effort="high",
)
self.coder_model = LiteLLMModel(
model_id="openrouter/anthropic/claude-3.5-sonnet",
api_key=os.environ["OPENROUTER_API_KEY"],
temperature=0.8
)
self.robust_model = LiteLLMModel(
model_id="o1",
)
# Research Agent
self.researcher = CodeAgent(
tools=[google_search, visit_webpage],
model=self.coder_model,
max_steps=6,
verbosity_level=3,
additional_authorized_imports=['math', 'queue', 'stat', 'statistics', 're', 'itertools', 'unicodedata', 'collections', 'datetime', 'time', 'random', 'bs4', 'markdownify', 'requests', 'pandas']
)
self.categories = {
"traditions": [
"marriage ceremonies",
"naming ceremonies",
"initiation rituals"
"storytelling",
"science"
],
"music": [
"traditional instruments",
"musical styles",
"dance forms",
"ceremonial music"
],
"social_structures": [
"family systems",
"leadership roles",
"age groups",
"community organization"
],
"cultural_values": [
"respect for elders",
"community solidarity",
"spiritual beliefs",
"oral traditions"
]
}
async def generate(self, prompt: str) -> str:
agent = CodeAgent(
tools=[],
model=self.model,
max_steps=6,
additional_authorized_imports=['bs4', 'stat', 'statistics', 'unicodedata', 'collections', 'requests', 'time', 'json', 'time', 'os','random', 'math', 'queue', 'markdownify', 're', 'itertools', 'datetime', 'pandas']
)
# Get the agent's response.
response = agent.run(prompt)
# If the response is a dictionary, convert it to a JSON string.
if isinstance(response, dict):
return json.dumps(response)
# Otherwise, return the response as is.
return response
async def generate_with_model(self, prompt: str) -> str:
try:
response = await self.generate(prompt)
return response if response else "{}"
except Exception as e:
print(f"Model generation error: {e}")
return "{}"
async def research_cultural_info(self, category: str, topic: str) -> Dict:
try:
research_prompt = f"""
You are an expert researcher on African History
Research and provide comprehensive information about {topic} in African {category}.
Focus on historical context, regional variations, and modern practices.
"""
research_data = self.researcher.run(research_prompt)
structure_prompt = f"""
Based on this research: {research_data}
Create a structured JSON with:
{{
"overview": "brief description",
"historical_context": "historical background",
"regional_variations": ["list of variations by region"],
"cultural_significance": "detailed significance",
"modern_practices": "current adaptations",
"sources": ["list of sources"]
}}
"""
structured_data = await self.generate_with_model(structure_prompt)
return json.loads(structured_data)
except json.JSONDecodeError as e:
print(f"JSON parsing error: {e}")
return {}
async def generate_qa_pairs(self, cultural_data: Dict) -> List[Dict]:
try:
qa_prompt = f"""
Based on this cultural information:
{json.dumps(cultural_data, indent=2)}
Generate 6 question-answer pairs in this JSON format:
[{{
"question": "detailed question",
"answer": "comprehensive answer",
"difficulty": "basic|intermediate|advanced",
"category": "historical|practical|conceptual",
"regions": ["relevant African regions"]
}}]
"""
qa_response = await self.generate_with_model(qa_prompt)
return json.loads(qa_response)
except Exception as e:
print(f"QA generation error: {e}")
return []
async def generate_reasoning(self, qa_pairs: List[Dict]) -> List[Dict]:
try:
reasoning_prompt = f"""
For these Q&A pairs:
{json.dumps(qa_pairs, indent=2)}
Generate detailed reasoning chains in this JSON format:
[{{
"question": "original question",
"reasoning_steps": [
"step 1: initial understanding",
"step 2: cultural context",
"step 3: analysis",
"step 4: conclusion"
],
"final_answer": "detailed answer",
"cultural_context": "relevant cultural background",
"sources": ["reference sources"]
}}]
"""
reasoning_data = await self.generate_with_model(reasoning_prompt)
return json.loads(reasoning_data)
except Exception as e:
print(f"Reasoning generation error: {e}")
return []
async def process_category(self, category: str, topic: str) -> Dict:
try:
cultural_data = await self.research_cultural_info(category, topic)
qa_pairs = await self.generate_qa_pairs(cultural_data)
reasoning_data = await self.generate_reasoning(qa_pairs)
return {
"category": category,
"topic": topic,
"cultural_data": cultural_data,
"qa_pairs": qa_pairs,
"reasoning_data": reasoning_data,
"metadata": {
"generated_at": datetime.now().isoformat(),
"model": "gpt-family/o3",
"version": "1.0"
}
}
except Exception as e:
print(f"Error processing {category}/{topic}: {e}")
return {"error": str(e)}
async def generate_dataset(self):
dataset = {}
for category, topics in self.categories.items():
dataset[category] = {}
for topic in topics:
print(f"Processing {category}/{topic}...")
dataset[category][topic] = await self.process_category(category, topic)
await asyncio.sleep(2)
with open("african_cultural_dataset.json", "w", encoding="utf-8") as f:
json.dump(dataset, f, indent=2, ensure_ascii=False)
return dataset
async def main():
api_key = os.environ["OPENAI_API_KEY"]
generator = AfricanCultureDataGenerator(api_key)
dataset = await generator.generate_dataset()
print("Dataset generation complete!")
if __name__ == "__main__":
await main()
Conclusion
This implementation demonstrates the power of specialized AI agents in creating rich African cultural datasets, leveraging a multi-agent architecture for research, QA generation, and reasoning chains. While the current implementation shows promise, transitioning to an E2B code executor with an Orchestrator would offer several advantages:
- Better execution control and resource management
- Improved error handling and API key management
- Parallel processing of cultural data collection
- Scalable infrastructure for larger datasets
- Enhanced monitoring and validation capabilities
The next phase should focus on:
- Implementing an Orchestrator to manage agent workflows
- Utilizing E2B's code execution environment for reliable processing
- Adding robust validation mechanisms for cultural accuracy
- Implementing parallel data collection across regions
- Enhancing the reasoning chain generation with distributed processing
This evolution would maintain the current system's cultural authenticity while adding enterprise-grade reliability and scalability.
Resources
- African Cultural Dataset : https://huggingface.co/datasets/Svngoku/african_cultural_data
- African Cultural QA Dataset: https://huggingface.co/datasets/Svngoku/african_cultural_qa_pairs
- African Cultural Reasoning Datatset : https://huggingface.co/datasets/Svngoku/african_cultural_reasoning