使用LLaMA-Factory部署DeepSeek-R1-Distill-Qwen-32B时发现,模型的输出有时候并不会以<think>开头,但还是会输出<\think>。https://github.com/deepseek-ai/DeepSeek-R1/issues/352

分析原因:

问题出在https://hf-mirror.com/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/blob/main/tokenizer_config.jsonchat_template里。下图是问题所在之处:

这里设置成这样是为了保证让模型的生成是以"<think>\n"开头的,然后开始思考过程,避免模型没有以"<think>\n"开头而不思考直接输出结果。

但在输出里却没看到<think>,是因为这个<think>已经是属于prompt部分了,不在输出部分里。

解决方法:

方法一:

直接在https://hf-mirror.com/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/blob/main/tokenizer_config.json里的chat_template改,删掉"<think>\\n",但这样就可能导致模型跳过思考过程。

DeepSeek-R1官方仓库也说推荐设置"<think>\n"为开头。

方法二:

我使用的是LLaMA-Factory仓库,设置template为qwen(不会像现在的deepseek3模板一样把<think>加到前面),使用Vllm部署的。可以直接去里面改一下代码,在模型生成时,强行把第一个token设置成<think>。

将vllm_engine.py里的_generate函数改一下:

主要思路就是在把对话经过chat_template变成prompt_ids后,给prompt_ids后面加上"<think>\n"对应的词表id。然后再让Vllm去给prompt做续写。

改完后该函数代码如下:

async def _generate(
        self,
        messages: Sequence[Dict[str, str]],
        system: Optional[str] = None,
        tools: Optional[str] = None,
        images: Optional[Sequence["ImageInput"]] = None,
        videos: Optional[Sequence["VideoInput"]] = None,
        **input_kwargs,
    ) -> AsyncIterator["RequestOutput"]:
        request_id = f"chatcmpl-{uuid.uuid4().hex}"
        if images is not None:
            if not any(IMAGE_PLACEHOLDER in message["content"] for message in messages):
                messages[0]["content"] = IMAGE_PLACEHOLDER * len(images) + messages[0]["content"]

        paired_messages = messages + [{"role": "assistant", "content": ""}]
        system = system or self.generating_args["default_system"]
        prefix_token = '<think>\n'
        prompt_ids, _ = self.template.encode_oneturn(self.tokenizer, paired_messages, system, tools)
        if prefix_token:
            paired_messages[-1]['content']=prefix_token
            prefix_token_ids=self.tokenizer(prefix_token,add_special_tokens=False)['input_ids']
            prompt_ids.extend(prefix_token_ids)
        # print(self.tokenizer.decode(prompt_ids))
        prompt_length = len(prompt_ids)

        use_beam_search: bool = self.generating_args["num_beams"] > 1
        temperature: Optional[float] = input_kwargs.pop("temperature", None)
        top_p: Optional[float] = input_kwargs.pop("top_p", None)
        top_k: Optional[float] = input_kwargs.pop("top_k", None)
        num_return_sequences: int = input_kwargs.pop("num_return_sequences", 1)
        repetition_penalty: Optional[float] = input_kwargs.pop("repetition_penalty", None)
        length_penalty: Optional[float] = input_kwargs.pop("length_penalty", None)
        max_length: Optional[int] = input_kwargs.pop("max_length", None)
        max_new_tokens: Optional[int] = input_kwargs.pop("max_new_tokens", None)
        stop: Optional[Union[str, List[str]]] = input_kwargs.pop("stop", None)

        if "max_new_tokens" in self.generating_args:
            max_tokens = self.generating_args["max_new_tokens"]
        elif "max_length" in self.generating_args:
            if self.generating_args["max_length"] > prompt_length:
                max_tokens = self.generating_args["max_length"] - prompt_length
            else:
                max_tokens = 1

        if max_length:
            max_tokens = max_length - prompt_length if max_length > prompt_length else 1

        if max_new_tokens:
            max_tokens = max_new_tokens

        sampling_params = SamplingParams(
            n=num_return_sequences,
            repetition_penalty=(
                repetition_penalty if repetition_penalty is not None else self.generating_args["repetition_penalty"]
            )
            or 1.0,  # repetition_penalty must > 0
            temperature=temperature if temperature is not None else self.generating_args["temperature"],
            top_p=(top_p if top_p is not None else self.generating_args["top_p"]) or 1.0,  # top_p must > 0
            top_k=top_k if top_k is not None else self.generating_args["top_k"],
            use_beam_search=use_beam_search,
            length_penalty=length_penalty if length_penalty is not None else self.generating_args["length_penalty"],
            stop=stop,
            stop_token_ids=[self.tokenizer.eos_token_id] + self.tokenizer.additional_special_tokens_ids,
            max_tokens=max_tokens,
            skip_special_tokens=True,
        )

        if images is not None:  # add image features
            image_data = []
            for image in images:
                if not isinstance(image, (str, ImageObject)):
                    raise ValueError(f"Expected image input is a path or PIL.Image, but got {type(image)}.")

                if isinstance(image, str):
                    image = Image.open(image).convert("RGB")

                image_data.append(image)

            multi_modal_data = {"image": image_data}
        else:
            multi_modal_data = None

        result_generator = self.model.generate(
            inputs={"prompt_token_ids": prompt_ids, "multi_modal_data": multi_modal_data},
            sampling_params=sampling_params,
            request_id=request_id,
            lora_request=self.lora_request,
        )
        return result_generator

现在问"1+1=?"时,把vllm的输入print出来,发现已经有了<think>。

同时,记得在构造API输出时,把"<think>\n"加上,不然输出里不会显示。

非流式输出:

流式输出:

改完代码后重新部署。

再测试一下API:

import requests
import json
if __name__ == "__main__":
    api_url = "http://xxxx/v1/chat/completions" # deepseek蒸馏qwen32b
    data = {
        "model": "test",
        "messages": [
            {
                #"role": "user", "content": "树上原本有4只鸟,开枪打死一只,还剩几只?"
                "role": "user", "content": "1+1=?"
            },
        ],
        "tools": [],
        "do_sample": True,
        "temperature": 0.6,
        "top_p": 0,
        "n": 1,
        "max_tokens": 512,
        "stream": False,
    }
    response = requests.post(url=api_url, data=json.dumps(data))
    print(response.json()["choices"][0]["message"]["content"])

发现输出有deepseek那味儿了。

<think>
Okay, so I have this problem here: 1 + 1 equals what? Hmm, let's see. I remember from when I was a kid, my teacher showed us how to add numbers. So, if I have one apple and then I get another apple, how many apples do I have in total? That makes sense. It should be two apples, right? So, 1 plus 1 should be 2. But wait, is there a different way to look at this? Maybe in some other context, like in computer science or something? Oh, right, in binary, 1 + 1 equals 10. But I think the question is asking for basic arithmetic, not binary. So, sticking with the simple addition, 1 plus 1 is 2. Yeah, that seems right. I don't think I'm missing anything here. It's pretty straightforward.
</think>

1 + 1 equals 2.

注:最新版LLaMA-Factory已经支持了DeepSeek-R1-Distill-Qwen-32B模板,需设置模板为deepseek3,但不支持强制思考。

使用最新版LLaMA-Factory测试结果如下:

每次一定会输出<think>和<\think>,但里面内容可能是空的,也就是其实没思考,只是通过自己写的模板把<think>和<\think>补上去。

<think>

</think>

1 + 1 equals 2.
Logo

火山引擎开发者社区是火山引擎打造的AI技术生态平台,聚焦Agent与大模型开发,提供豆包系列模型(图像/视频/视觉)、智能分析与会话工具,并配套评测集、动手实验室及行业案例库。社区通过技术沙龙、挑战赛等活动促进开发者成长,新用户可领50万Tokens权益,助力构建智能应用。

更多推荐