Linux 安装Akegarasu(秋叶) Lora-Script 丹炉

博主：克罗西丁
发布时间：2024 年 09 月 25 日
1782 次浏览
暂无评论
4667字数
分类： Linux

LoRA-scripts（SD-Trainer）

LoRA & Dreambooth 训练图形界面 & 脚本预设 & 一键训练环境

一、必要依赖:

Python 3.10
Git

二、获取Lora-Scripts

克隆带子模块的仓库

git clone --recurse-submodules https://github.com/Akegarasu/lora-scripts

三、Conda创建环境与依赖

安装依赖

如果你和我一样是Conda环境请看此内容，如果你不是可以忽略此步骤，直接移步四、运行安装脚本步骤。

Conda虚拟环境创建

conda create -n lora python=3.10
conda activate lora

请编辑根目录下的编辑 install.bash，并将create_venv=true 项目更改为 false

#!/usr/bin/bash

script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
create_venv=false

四、运行安装脚本

给予install.bash运行权限

sudo chmod 755 install.bash

运行安装脚本，会自动安装所需的依赖和虚拟环境

如果完成了三、Conda创建环境与依赖，则会自动调用Conda的Python环境

./install.bash

五、配置与运行

运行gui.py将会启动并自动下载所需的一些额外依赖，并运行在localhost:29080端口，TensorBoard运行在localhost:29070端口

python gui.py --listen --port 29080 --tensorboard-port 29070 --tensorboard-host 0.0.0.0

程序参数

参数名称	类型	默认值	描述
`--host`	str	"127.0.0.1"	服务器的主机名
`--port`	int	28000	运行服务器的端口
`--listen`	bool	false	启用服务器的监听模式
`--skip-prepare-environment`	bool	false	跳过环境准备步骤
`--disable-tensorboard`	bool	false	禁用 TensorBoard
`--disable-tageditor`	bool	false	禁用标签编辑器
`--tensorboard-host`	str	"127.0.0.1"	运行 TensorBoard 的主机
`--tensorboard-port`	int	6006	运行 TensorBoard 的端口
`--localization`	str		界面的本地化设置
`--dev`	bool	false	开发者模式，用于禁用某些检查

编写启动脚本

先创建一个启动脚本

nano start.sh

写入脚本代码：

#!/bin/bash
#用于激活Conda环境
eval "$(conda shell.bash hook)"
conda activate lora
#export CUDA_VISIBLE_DEVICES=(多卡时，决定启用哪些卡)
#export TOKENIZERS_PARALLELISM=false
#export NCCL_SHM_DISABLE=1
python gui.py --listen --port 29080 --tensorboard-port 29070 --tensorboard-host 0.0.0.0

其它：遇到的问题处理

NCCL共享内存错误

由于环境在Docker内，共享内存/dev/shm只有64M,导致炼丹报错:

[rank3]: torch.distributed.DistBackendError: NCCL error in: 
../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled system error (run with NCCL\_DEBUG=INFO for details), NCCL version 2.20.5
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.  
[rank3]: Last error:
[rank3]: Error while creating shared memory segment /dev/shm/nccl-Bvx3cU (size 9637888)

1.尝试增大共享内存空间

可以通过修改 /etc/fstab 文件增加 /dev/shm的大小

2.临时禁用共享内存使用

如果你无法调整 Docker 容器的启动参数，仍可以使用禁用共享内存方法

export NCCL_SHM_DISABLE=1

3.清理共享内存空间

如果你无法扩大 /dev/shm 的空间，可能需要清理一些系统进程以释放更多共享内存。
使用以下命令查看和释放 /dev/shm 中正在使用的文件

ls -lh /dev/shm

'frontend/dist' does not exist 报错

这是由于在Linux没有创建前端页面导致的错误，虽然看到端口启用了，但是实际访问却是空的
只需要在Akegarasu的仓库获取最新的前端即可
获取Dist目录:GitHub - Akegarasu/lora-gui-dist

将其放入 'frontend' 目录下即可：

frontend/
└── dist
    ├── 404.html
    ├── assets
    ├── dreambooth
    ├── index.html
    ├── lora
    ├── other
    ├── tageditor.html
    ├── tagger.html
    ├── task.html
    └── tensorboard.html

报错提示：RuntimeError: Couldn't install xxx

Traceback (most recent call last):
  File "/root/lora-scripts/gui.py", line 93, in <module>
    launch()
  File "/root/lora-scripts/gui.py", line 59, in launch
    prepare_environment(disable_auto_mirror=args.disable_auto_mirror)
  File "/root/lora-scripts/mikazuki/launch_utils.py", line 288, in prepare_environment
    validate_requirements("requirements.txt")
  File "/root/lora-scripts/mikazuki/launch_utils.py", line 197, in validate_requirements
    run_pip(f"install {line}", line, live=True)
  File "/root/lora-scripts/mikazuki/launch_utils.py", line 253, in run_pip
    return run(f'"{python_bin}" -m pip {command}', desc=f"Installing {desc}", errdesc=f"Couldn't install {desc}", live=live)
  File "/root/lora-scripts/mikazuki/launch_utils.py", line 95, in run
    raise RuntimeError(f"""{errdesc or 'Error running command'}.
RuntimeError: Couldn't install numpy<=2.0. 
Command: "/root/lora-scripts/venv/bin/python" -m pip install numpy<=2.0
Error code: 2

这是由于缺少依赖库导致的错误，从报错的信息来看，RuntimeError: Couldn't install numpy<=2.0. ,这里是因为numpy的版本不受支持导致的：

可以先运行前尝试执行：

pip install "numpy<=2.0"

如果显示安装成功却还是无法启动，可能是因为问题出在脚本尝试安装 numpy<=2.0 时，Shell 将 <=2.0 解释为文件重定向符号，而不是版本约束条件。因为脚本使用了 subprocess.run() 或类似的方式调用 pip，但没有正确处理版本约束条件。

尝试修改脚本逻辑：
打开 mikazuki/launch_utils.py 文件，找到 run_pip() 函数

找到以下代码：

return run(f'"{python_bin}" -m pip {command}', desc=f"Installing {desc}", errdesc=f"Couldn't install {desc}", live=live)

改为：

return run(f'"{python_bin}" -m pip install "{desc}"', desc=f"Installing {desc}", errdesc=f"Couldn't install {desc}", live=live)

训练失败：HTTPSConnectionPool(host='huggingface.co'...

requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): 
Max retries exceeded with url: /openai/clip-vit-large-patch14/resolve/main/tokenizer_config.json 
(Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate 
verify failed: self-signed certificate in certificate chain (_ssl.c:1007)')))"), 
'(Request ID: bc43902a-c5ad-4e5b-baea-a5848d6fbcfb)')
21:22:35-282538 ERROR    Training failed / 训练失败

这大概率是因为你的计算机无法访问'huggingface'导致的，可以使用VPN或修改成国内镜像源：

export HF_ENDPOINT=https://hf-mirror.com

如果需要临时使用，直接在运行前执行。
如果需要永久使用，请使用vi或nano添加到到 ~/.bashrc 的末尾
然后执行更新环境变量: source ~/.bashrc 即可

最后修改：2025 年 02 月 20 日

喜欢就请我喝一杯奶茶吧~

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

评论 *

私密评论

名称 *

🎲

邮箱 *

地址

Linux 安装Akegarasu(秋叶) Lora-Script 丹炉

克罗西丁 • 2024 年 09 月 25 日

<h1>LoRA-scripts（SD-Trainer）</h1><p>LoRA & Dreambooth 训练图形界面 & 脚本预设 & 一键训练环境</p><p><span class="external-link"><a class="no-external-link" href="https://github.com/Akegarasu/lora-scripts" target="_blank"><i data-feather="external-link"></i>项目原地址:GitHub - Akegarasu/lora-scripts</a></span></p><hr><h2>一、必要依赖:</h2><ul><li>Python 3.10</li><li>Git</li></ul><h2>二、获取Lora-Scripts</h2><h3>克隆带子模块的仓库</h3><pre><code class="lang-bash">git clone --recurse-submodules https://github.com/Akegarasu/lora-scripts</code></pre><h2>三、Conda创建环境与依赖</h2><h3>安装依赖</h3><p>如果你和我一样是<code>Conda</code>环境请看此内容，如果你不是可以忽略此步骤，直接移步<code>四、运行安装脚本</code>步骤。</p><p><strong>Conda虚拟环境创建</strong></p><pre><code class="lang-bash">conda create -n lora python=3.10
conda activate lora</code></pre><p>请编辑根目录下的编辑 <code>install.bash</code>，并将<code>create_venv=true</code> 项目更改为 <code>false</code></p><pre><code class="lang-yaml">#!/usr/bin/bash

script_dir=&quot;$( cd &quot;$( dirname &quot;${BASH_SOURCE[0]}&quot; )&quot; &gt;/dev/null 2&gt;&amp;1 &amp;&amp; pwd )&quot;
create_venv=false</code></pre><h2>四、运行安装脚本</h2><p>给予<code>install.bash</code>运行权限</p><pre><code class="lang-bash">sudo chmod 755 install.bash</code></pre><p>运行安装脚本，会自动安装所需的依赖和虚拟环境</p><blockquote>如果完成了<code>三、Conda创建环境与依赖</code>，则会自动调用Conda的Python环境</blockquote><pre><code class="lang-bash">./install.bash</code></pre><h2>五、配置与运行</h2><p>运行<code>gui.py</code>将会启动并自动下载所需的一些额外依赖，并运行在<code>localhost:29080</code>端口，TensorBoard运行在<code>localhost:29070</code>端口</p><pre><code class="lang-bash">python gui.py --listen --port 29080 --tensorboard-port 29070 --tensorboard-host 0.0.0.0</code></pre><h3>程序参数</h3><table><thead><tr><th>参数名称</th><th>类型</th><th>默认值</th><th>描述</th></tr></thead><tbody><tr><td><code>--host</code></td><td>str</td><td>"127.0.0.1"</td><td>服务器的主机名</td></tr><tr><td><code>--port</code></td><td>int</td><td>28000</td><td>运行服务器的端口</td></tr><tr><td><code>--listen</code></td><td>bool</td><td>false</td><td>启用服务器的监听模式</td></tr><tr><td><code>--skip-prepare-environment</code></td><td>bool</td><td>false</td><td>跳过环境准备步骤</td></tr><tr><td><code>--disable-tensorboard</code></td><td>bool</td><td>false</td><td>禁用 TensorBoard</td></tr><tr><td><code>--disable-tageditor</code></td><td>bool</td><td>false</td><td>禁用标签编辑器</td></tr><tr><td><code>--tensorboard-host</code></td><td>str</td><td>"127.0.0.1"</td><td>运行 TensorBoard 的主机</td></tr><tr><td><code>--tensorboard-port</code></td><td>int</td><td>6006</td><td>运行 TensorBoard 的端口</td></tr><tr><td><code>--localization</code></td><td>str</td><td> </td><td>界面的本地化设置</td></tr><tr><td><code>--dev</code></td><td>bool</td><td>false</td><td>开发者模式，用于禁用某些检查</td></tr></tbody></table><h3>编写启动脚本</h3><p>先创建一个启动脚本</p><pre><code class="lang-bash">nano start.sh</code></pre><p>写入脚本代码：</p><pre><code class="lang-bash">#!/bin/bash
#用于激活Conda环境
eval &quot;$(conda shell.bash hook)&quot;
conda activate lora
#export CUDA_VISIBLE_DEVICES=(多卡时，决定启用哪些卡)
#export TOKENIZERS_PARALLELISM=false
#export NCCL_SHM_DISABLE=1
python gui.py --listen --port 29080 --tensorboard-port 29070 --tensorboard-host 0.0.0.0 </code></pre><h2>其它：遇到的问题处理</h2><h3>NCCL共享内存错误</h3><p>由于环境在Docker内，共享内存<code>/dev/shm</code>只有<code>64M</code>,导致炼丹报错:</p><pre><code class="lang-text">[rank3]: torch.distributed.DistBackendError: NCCL error in: 
../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled system error (run with NCCL\_DEBUG=INFO for details), NCCL version 2.20.5
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.  
[rank3]: Last error:
[rank3]: Error while creating shared memory segment /dev/shm/nccl-Bvx3cU (size 9637888)</code></pre><p><strong>1.尝试增大共享内存空间</strong></p><ul><li>可以通过修改 <code>/etc/fstab</code> 文件增加 <code>/dev/shm</code>的大小</li></ul><p><strong>2.临时禁用共享内存使用</strong></p><ul><li>如果你无法调整 <code>Docker</code> 容器的启动参数，仍可以使用禁用共享内存方法</li></ul><pre><code class="lang-bash">export NCCL_SHM_DISABLE=1</code></pre><p><strong>3.清理共享内存空间</strong></p><ul><li>如果你无法扩大 <code>/dev/shm</code> 的空间，可能需要清理一些系统进程以释放更多共享内存。</li><li>使用以下命令查看和释放 <code>/dev/shm</code> 中正在使用的文件</li></ul><pre><code class="lang-bash">ls -lh /dev/shm</code></pre><h3>'frontend/dist' does not exist 报错</h3><p>这是由于在Linux没有创建前端页面导致的错误，虽然看到端口启用了，但是实际访问却是空的<br>只需要在Akegarasu的仓库获取最新的前端即可<br><span class="external-link"><a class="no-external-link" href="https://github.com/hanamizuki-ai/lora-gui-dist" target="_blank"><i data-feather="external-link"></i>获取Dist目录:GitHub - Akegarasu/lora-gui-dist</a></span></p><p>将其放入 'frontend' 目录下即可：</p><pre><code class="lang-text">frontend/
└── dist
    ├── 404.html
    ├── assets
    ├── dreambooth
    ├── index.html
    ├── lora
    ├── other
    ├── tageditor.html
    ├── tagger.html
    ├── task.html
    └── tensorboard.html</code></pre><h3>报错提示：RuntimeError: Couldn't install xxx</h3><pre><code class="lang-text">Traceback (most recent call last):
  File &quot;/root/lora-scripts/gui.py&quot;, line 93, in &lt;module&gt;
    launch()
  File &quot;/root/lora-scripts/gui.py&quot;, line 59, in launch
    prepare_environment(disable_auto_mirror=args.disable_auto_mirror)
  File &quot;/root/lora-scripts/mikazuki/launch_utils.py&quot;, line 288, in prepare_environment
    validate_requirements(&quot;requirements.txt&quot;)
  File &quot;/root/lora-scripts/mikazuki/launch_utils.py&quot;, line 197, in validate_requirements
    run_pip(f&quot;install {line}&quot;, line, live=True)
  File &quot;/root/lora-scripts/mikazuki/launch_utils.py&quot;, line 253, in run_pip
    return run(f'&quot;{python_bin}&quot; -m pip {command}', desc=f&quot;Installing {desc}&quot;, errdesc=f&quot;Couldn't install {desc}&quot;, live=live)
  File &quot;/root/lora-scripts/mikazuki/launch_utils.py&quot;, line 95, in run
    raise RuntimeError(f&quot;&quot;&quot;{errdesc or 'Error running command'}.
RuntimeError: Couldn't install numpy&lt;=2.0. 
Command: &quot;/root/lora-scripts/venv/bin/python&quot; -m pip install numpy&lt;=2.0
Error code: 2</code></pre><p>这是由于缺少依赖库导致的错误，从报错的信息来看，<code>RuntimeError: Couldn't install numpy&lt;=2.0. </code>,这里是因为numpy的版本不受支持导致的：</p><p>可以先运行前尝试执行：</p><pre><code class="lang-bash">pip install &quot;numpy&lt;=2.0&quot;</code></pre><p>如果显示安装成功却还是无法启动，可能是因为问题出在脚本尝试安装 numpy&lt;=2.0 时，Shell 将 &lt;=2.0 解释为文件重定向符号，而不是版本约束条件。因为脚本使用了 subprocess.run() 或类似的方式调用 pip，但没有正确处理版本约束条件。</p><p>尝试修改脚本逻辑：<br>打开 <code>mikazuki/launch_utils.py</code> 文件，找到 <code>run_pip()</code> 函数</p><p>找到以下代码：</p><pre><code class="lang-python">return run(f'&quot;{python_bin}&quot; -m pip {command}', desc=f&quot;Installing {desc}&quot;, errdesc=f&quot;Couldn't install {desc}&quot;, live=live)</code></pre><p>改为：</p><pre><code class="lang-python">return run(f'&quot;{python_bin}&quot; -m pip install &quot;{desc}&quot;', desc=f&quot;Installing {desc}&quot;, errdesc=f&quot;Couldn't install {desc}&quot;, live=live)</code></pre><h3>训练失败：HTTPSConnectionPool(host='huggingface.co'...</h3><pre><code class="lang-bash">requests.exceptions.SSLError: (MaxRetryError(&quot;HTTPSConnectionPool(host='huggingface.co', port=443): 
Max retries exceeded with url: /openai/clip-vit-large-patch14/resolve/main/tokenizer_config.json 
(Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate 
verify failed: self-signed certificate in certificate chain (_ssl.c:1007)')))&quot;), 
'(Request ID: bc43902a-c5ad-4e5b-baea-a5848d6fbcfb)')
21:22:35-282538 ERROR    Training failed / 训练失败</code></pre><p>这大概率是因为你的计算机无法访问'huggingface'导致的，可以使用VPN或修改成国内镜像源：</p><pre><code class="lang-bash">export HF_ENDPOINT=https://hf-mirror.com</code></pre><p>如果需要临时使用，直接在运行前执行。<br>如果需要永久使用，请使用<code>vi</code>或<code>nano</code>添加到到 <code>~/.bashrc</code> 的末尾<br>然后执行更新环境变量: <code>source ~/.bashrc</code> 即可</p>

Linux 安装Akegarasu(秋叶) Lora-Script 丹炉

LoRA-scripts（SD-Trainer）

一、必要依赖:

二、获取Lora-Scripts

克隆带子模块的仓库

三、Conda创建环境与依赖

安装依赖

四、运行安装脚本

五、配置与运行

程序参数

编写启动脚本

其它：遇到的问题处理

NCCL共享内存错误

'frontend/dist' does not exist 报错

报错提示：RuntimeError: Couldn't install xxx

训练失败：HTTPSConnectionPool(host='huggingface.co'...

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

小格自用公告逼乎体

Linux 安装 Koishi

OTA更新包提取Payload.bin

MAC Bigsur 挂载NTFS移动硬盘

Ubuntu解决MCBE找不到libssl.so.1.1共享库的错误

MAC 开启HIDPI缩放

Linux架设Palworld(幻兽帕鲁)官方专用服务器

JAVA批量生成账户密码

Windows批处理快速启用/禁用代理服务

IDEA控制台输出中文乱码（Tomcat）

Linux 安装Akegarasu(秋叶) Lora-Script 丹炉

LoRA-scripts（SD-Trainer）

一、必要依赖:

二、获取Lora-Scripts

克隆带子模块的仓库

三、Conda创建环境与依赖

安装依赖

四、运行安装脚本

五、配置与运行

程序参数

编写启动脚本

其它：遇到的问题处理

NCCL共享内存错误

'frontend/dist' does not exist 报错

报错提示：RuntimeError: Couldn't install xxx

训练失败：HTTPSConnectionPool(host='huggingface.co'...

发表评论 取消回复 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

Linux 安装Akegarasu(秋叶) Lora-Script 丹炉

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款