Celso F
celsowm
AI & ML interests
None yet
Recent Activity
new activity
21 days ago
deepseek-ai/DeepSeek-V3-Base:Resource Requirements for Running DeepSeek v3 Locally
reacted
to
nroggendorff's
post
with ๐ค
about 1 month ago
hey nvidia, can you send me a gpu?
comment or react if you want ~~me~~ to get one too. ๐๐
reacted
to
nroggendorff's
post
with โ
about 1 month ago
hey nvidia, can you send me a gpu?
comment or react if you want ~~me~~ to get one too. ๐๐
Organizations
None yet
celsowm's activity
Resource Requirements for Running DeepSeek v3 Locally
5
#56 opened about 1 month ago
by
wilfoderek
![](https://cdn-avatars.huggingface.co/v1/production/uploads/60bfa4237f75bb4d92557db9/8Vu3xJkqI59GrtoFrZbwj.jpeg)
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63a278c3f30c4642278d4259/W0U2_asElVWplHF6sLsDf.png)
reacted to
nroggendorff's
post with ๐คโ
about 1 month ago
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63a278c3f30c4642278d4259/W0U2_asElVWplHF6sLsDf.png)
upvoted
a
paper
3 months ago
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63a278c3f30c4642278d4259/W0U2_asElVWplHF6sLsDf.png)
reacted to
singhsidhukuldeep's
post with ๐ฅ
4 months ago
Post
1854
Good folks at
@nvidia
have released exciting new research on normalized Transformers (nGPT) for faster and more efficient language modeling!
Here is what they are proposing:
1. Remove all normalization layers, like RMSNorm or LayerNorm, from the standard Transformer architecture.
2. Normalize all matrices along their embedding dimension after each training step. This includes input and output embeddings, attention matrices (Q, K, V), output projection matrices, and MLP matrices.
3. Replace the standard residual connections with normalized update equations using learnable eigen learning rates for the attention and MLP blocks.
4. Change the softmax scaling factor in the attention mechanism from 1/sqrt of d_k to sqrt of d_k.
5. Implement rescaling and optional normalization of query (q) and key (k) vectors in the attention mechanism using learnable scaling factors.
6. Rescale the intermediate states of the MLP block using learnable scaling factors.
7. Implement rescaling of the output logits using learnable scaling factors.
8. Remove weight decay and learning rate warmup from the optimization process.
9. Initialize the eigen learning rates and scaling factors with appropriate values as specified in the paper.
10. During training, treat all vectors and matrices as residing on a unit hypersphere, interpreting matrix-vector multiplications as cosine similarities.
11. Implement the update equations for the hidden states using the normalized outputs from attention and MLP blocks, controlled by the eigen learning rates.
12. After each forward pass, normalize all parameter matrices to ensure they remain on the unit hypersphere.
13. Use the Adam optimizer without weight decay for training the model.
14. When computing loss, apply the learnable scaling factor to the logits before the softmax operation.
15. During inference, follow the same normalization and scaling procedures as in training.
Excited to see how it scales to larger models and datasets!
Here is what they are proposing:
1. Remove all normalization layers, like RMSNorm or LayerNorm, from the standard Transformer architecture.
2. Normalize all matrices along their embedding dimension after each training step. This includes input and output embeddings, attention matrices (Q, K, V), output projection matrices, and MLP matrices.
3. Replace the standard residual connections with normalized update equations using learnable eigen learning rates for the attention and MLP blocks.
4. Change the softmax scaling factor in the attention mechanism from 1/sqrt of d_k to sqrt of d_k.
5. Implement rescaling and optional normalization of query (q) and key (k) vectors in the attention mechanism using learnable scaling factors.
6. Rescale the intermediate states of the MLP block using learnable scaling factors.
7. Implement rescaling of the output logits using learnable scaling factors.
8. Remove weight decay and learning rate warmup from the optimization process.
9. Initialize the eigen learning rates and scaling factors with appropriate values as specified in the paper.
10. During training, treat all vectors and matrices as residing on a unit hypersphere, interpreting matrix-vector multiplications as cosine similarities.
11. Implement the update equations for the hidden states using the normalized outputs from attention and MLP blocks, controlled by the eigen learning rates.
12. After each forward pass, normalize all parameter matrices to ensure they remain on the unit hypersphere.
13. Use the Adam optimizer without weight decay for training the model.
14. When computing loss, apply the learnable scaling factor to the logits before the softmax operation.
15. During inference, follow the same normalization and scaling procedures as in training.
Excited to see how it scales to larger models and datasets!
space or lmarena to test it online
1
#13 opened 4 months ago
by
celsowm
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63a278c3f30c4642278d4259/W0U2_asElVWplHF6sLsDf.png)
11b instruct gguf?
3
#1 opened 5 months ago
by
celsowm
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63a278c3f30c4642278d4259/W0U2_asElVWplHF6sLsDf.png)
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63a278c3f30c4642278d4259/W0U2_asElVWplHF6sLsDf.png)
reacted to
bartowski's
post with โค๏ธ
5 months ago
Post
34982
Reposting from twitter:
Just so you all know, I'll be on vacation for the following two weeks and away from home! I'm hoping to get on at least once a day to load up some quants, but I won't be as bleeding edge and on the ball :) feel free to shoot me a message if you see one I should make!
In the meantime if you need something bleeding edge make sure to check out @MaziyarPanahi or @bullerwins who both put out great work!
Just so you all know, I'll be on vacation for the following two weeks and away from home! I'm hoping to get on at least once a day to load up some quants, but I won't be as bleeding edge and on the ball :) feel free to shoot me a message if you see one I should make!
In the meantime if you need something bleeding edge make sure to check out @MaziyarPanahi or @bullerwins who both put out great work!
Promising looking results on 24GB VRAM folks!
9
#3 opened 5 months ago
by
ubergarm
AttributeError: 'AutoencoderKLCogVideoX' object has no attribute 'enable_tiling'
2
#11 opened 6 months ago
by
celsowm
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63a278c3f30c4642278d4259/W0U2_asElVWplHF6sLsDf.png)
Your feedback on HuggingChat
281
#1 opened almost 2 years ago
by
victor
![](https://cdn-avatars.huggingface.co/v1/production/uploads/5f17f0a0925b9863e28ad517/X7QKoiXbUtEZSG9jyvfk3.jpeg)
GGUF quants version?
1
#1 opened 6 months ago
by
celsowm
![](https://cdn-avatars.huggingface.co/v1/production/uploads/63a278c3f30c4642278d4259/W0U2_asElVWplHF6sLsDf.png)