base_model:

fblgit/cybertron-v4-qw7B-MGS
huihui-ai/Qwen2.5-7B-Instruct-abliterated-v3
FreedomIntelligence/HuatuoGPT-o1-7B
rombodawg/Rombos-LLM-V2.5-Qwen-7b
Qwen/Qwen2.5-7B-Instruct library_name: transformers tags:
mergekit
merge
funny
conversational
text-generation
chihuahua-powerful
boolean-expression-champion
math-avoider
object-counting-struggler

Xiaojian9992024/Qwen2.5-THREADRIPPER-Small - The "Small" is Just for Show (and Benchmarks)

Model Description:

Behold, the Qwen2.5-THREADRIPPER-Small! Don't let the "Small" in the name fool you; this model is compactly powerful... in the same way a chihuahua is "compactly powerful." We merged a bunch of models together using some fancy algorithm called "Linear DELLA" because, frankly, we thought it sounded cool. Did it work? Well...

Think of this model as the Frankenstein's monster of language models, but instead of being scary, it's just... kinda there. It's built upon the mighty Qwen2.5-7B-Instruct, and then we threw in a cybernetic one, an "abliterated" one (we're not sure what that means either), one that's good at medical stuff (maybe it can diagnose your code?), and another one named "Rombos" because why not?

Merge Details

Merge Method

This model was merged using the Linear DELLA merge method using Qwen/Qwen2.5-7B-Instruct as a base, because "Linear DELLA" sounded like it knew what it was doing. (Spoiler: jury's still out).

Models Merged

We lovingly stitched together the following models to create this... unique entity:

fblgit/cybertron-v4-qw7B-MGS - For that cybernetic edge (we hope).
huihui-ai/Qwen2.5-7B-Instruct-abliterated-v3 - "Abliterated" - sounds intense, right? We're banking on it.
FreedomIntelligence/HuatuoGPT-o1-7B - Maybe it can give your code a check-up? (Probably not).
rombodawg/Rombos-LLM-V2.5-Qwen-7b - Because every good merge needs a "Rombos".
Qwen/Qwen2.5-7B-Instruct - Our solid, dependable base. Relatively speaking.

Benchmarks - Prepare for a Wild Ride (Mostly Downhill):

Okay, let's talk numbers. We ran this bad boy on the Open LLM Leaderboard, and the results are... well, they're numbers! Don't stare directly at them for too long.

MMLU "Pro": Accuracy: 43.5%. Yes, you read that right. It's about as accurate as guessing, but hey, at least it's consistently around 43%! We're aiming for consistency here, folks. Participation trophies for everyone!
BBH - Boolean Expressions: Accuracy (Normalized): 83.6%. BOOM! Boolean expressions? Nailed it! Ask it if "true and false or true" is true, and it'll get it right most of the time. Boolean logic? Bring it on! Existential questions? Maybe not.
BBH - Object Counting: Accuracy (Normalized): 33.6%. Counting objects? Apparently, this is where our Threadripper trips over its own feet. Maybe it needs glasses? Or perhaps objects are just inherently confusing, even for advanced AI. We blame the objects.
BBH - Tracking Shuffled Objects (7 objects): Accuracy (Normalized): 14.4%. Seven objects? Forget about it. Three objects? Still bad (22.4%). Five objects? Slightly less terrible (22%). If you need something tracked, maybe use GPS, not this model. Unless you're tracking Boolean values. Then we're golden.
GPQA: Accuracy (Normalized): ~30%. GPQA? More like GP-"Q"-Maybe-"A". It's trying its best, okay? Lower your expectations.
Math Hard (Algebra, Counting, Geometry, etc.): Exact Match: 0.0%. Zero. Zilch. Nada. If you need help with your math homework, please, for the love of numbers, use a calculator. Or ask a human. Or a very sophisticated pigeon. Anything but this for math. Seriously, anything.

Intended Use:

Conversational? Sure, if you like conversations that are 43.5% accurate on general knowledge and amazing at Boolean expressions but can't count or track objects. It's definitely a conversation starter... about the limitations of language models.
Text Generation? Absolutely! It generates text. Whether that text is coherent, accurate, or helpful is another question entirely. But it does generate text, and sometimes that's the best you can ask for. Think of it as performance art.
Funny Model Cards? Clearly, yes. It excels at providing benchmark data that is hilarious when you try to spin it positively. We're leaning into our strengths here.

Limitations:

Math. Just... math. Avoid math. Run away from math. If you even think about math, this model will give you a blank stare and possibly start reciting Boolean expressions for comfort.
Object counting and tracking. Objects are its nemesis. Especially when shuffled. Or when there are more than two. Actually, just avoid objects in general. Stick to abstract concepts.
GPQA. We're still not sure what GPQA is, and neither is the model, apparently. It's a mystery for the ages.
May occasionally hallucinate benchmark scores that are slightly better than reality (we're working on our honesty module... or maybe not).

How to Use:

Use responsibly? Or irresponsibly, we're not your boss. Just don't expect it to balance your checkbook or track your keys. For Boolean expressions though? It's your champion. Need to know if "cat is animal AND animal has fur"? This model's got you.

Disclaimer:

Side effects of using this model may include: existential dread, questioning the nature of intelligence, and a sudden urge to count shuffled objects yourself to prove you're better than a language model. Use at your own risk. But hey, at least it's small! And sometimes, small and funny is all you need.

Configuration

The following YAML configuration was used to produce this model (because you asked, not because we understand it):

merge_method: della_linear
base_model: Qwen/Qwen2.5-7B-Instruct
dtype: bfloat16
parameters:
  epsilon: 0.015            # Fine-grain scaling for precision. (Sounds important!)
  lambda: 1.6               # Strong emphasis on top-performing models. (We aimed high!)
  normalize: true           # Stable parameter integration across models. (Stability is key, even if accuracy isn't)
adaptive_merge_parameters:
  task_weights:
    tinyArc: 1.75           # Logical reasoning. (For when you need *some* logic)
    tinyHellaswag: 1.65     # Contextual predictions. (It tries, bless its heart)
    tinyMMLU: 1.8           # Domain knowledge. (Limited domains, mostly Boolean expressions)
    tinyTruthfulQA: 2.0     # Prioritize truthful reasoning. (Truthful-ish. Mostly.)
    tinyTruthfulQA_mc1: 1.85 # Even more truthful-ishness!
    tinyWinogrande: 1.9     # Advanced reasoning and predictions. (Baby steps in advanced reasoning)
    IFEval: 2.1             # Instruction-following and multitasking. (Instructions followed loosely. Multitasking? Define "task".)
    BBH: 2.25                # Complex reasoning. (Boolean expressions are complex, right?)
    MATH: 2.4               # Mathematical reasoning. (Just kidding. See benchmarks.)
    GPQA: 2.35             # Factual QA. (Facts are... subjective?)
    MUSR: 2.3             # Multi-step reasoning. (One step at a time, maybe?)
    MMLU-PRO: 2.35       # Domain multitask performance. (Boolean expressions. We keep coming back to Boolean expressions.)
  smoothing_factor: 0.05     # TURN UP THE SMOOTH! (Smoothness is next to godliness, or at least accuracy)
models:
  - model: Qwen/Qwen2.5-7B-Instruct
    parameters:
      weight: 0.65 # The heavy lifter (relatively speaking)
      density: 0.65 # Dense with... something.
  - model: huihui-ai/Qwen2.5-7B-Instruct-abliterated-v3
    parameters:
      weight: 0.1 # A touch of "abliteration" for flavor
      density: 0.1 # Just a sprinkle
  - model: rombodawg/Rombos-LLM-V2.5-Qwen-7b
    parameters:
      weight: 0.15 # Rombos-ness, now in model form!
      density: 0.15 # A dash more density
  - model: fblgit/cybertron-v4-qw7B-MGS
    parameters:
      weight: 0.05 # Cybertron! Pew pew! (Performance may vary)
      density: 0.05 # A smidgen of cybernetics
  - model: FreedomIntelligence/HuatuoGPT-o1-7B
    parameters:
      weight: 0.05 # Medical intelligence? Maybe?
      density: 0.05 # Homeopathic dose of medical knowledge
´´´

Xiaojian9992024
/

Qwen2.5-THREADRIPPER-Small