HTML to Markdown & Markdown to HTML Conversion Guide
Master converting between HTML and Markdown formats with this comprehensive guide. Learn conversion techniques, tools, best practices, and handle edge cases for seamless document transformation.
Understanding Format Conversion
Converting between HTML and Markdown is a common need in web development, content management, and documentation. Each format has its strengths: HTML is the web standard for rendering, while Markdown is lightweight and human-readable.
This guide covers both directions of conversion, including edge cases, tool selection, and best practices for maintaining content quality during transformation.
When to Convert
- Markdown to HTML: Publishing documentation, blogging platforms, static site generators
- HTML to Markdown: Content preservation, documentation standardization, version control
- Both Directions: Content migration, format flexibility, multi-platform support
Format Comparison
Markdown: Lightweight, readable source code, easy to version control
HTML: Full semantic markup, browser native, precise styling support
Converting Markdown to HTML
Markdown to HTML conversion is straightforward and widely supported. Most platforms use parsers like marked, pandoc, or commonmark.
Basic Markdown Syntax
JavaScript Conversion
Python Conversion
Handling Advanced Markdown
Common Pitfalls
- Not escaping HTML special characters in Markdown text
- Inconsistent line break handling
- Not properly handling nested lists
- Missing language specification in code blocks
Converting HTML to Markdown
HTML to Markdown conversion is more complex due to semantic markup variations. Many approaches exist, from regex-based to DOM-based parsing.
Challenges in HTML to Markdown
- Semantic HTML elements (header, footer, nav) have no direct Markdown equivalent
- Complex styling and nested structures
- Handling of different heading levels and emphasis
- Table and list conversions
- Inline HTML preservation
JavaScript HTML to Markdown
Using TurndownJS
Python HTML to Markdown
Advanced Conversions
Boost Your Development Workflow
Professional developer tools suite - Format, validate, and optimize your code in seconds
Tools and Libraries Comparison
Choosing the right tool depends on your specific needs, performance requirements, and edge cases.
Markdown to HTML Tools
HTML to Markdown Tools
Selection Criteria
- marked: Fast, simple, good for basic Markdown
- remark: Powerful AST-based, good for plugins
- TurndownJS: Best HTML to Markdown, handles complexity
- pandoc: Universal converter, excellent quality
- markdown-it: Extensible, great plugin ecosystem
Handling Edge Cases
Real-world conversions often involve edge cases that require special handling.
Common Edge Cases
Preserving Content Quality
Testing Conversions
Best Practices for Conversion
Follow these practices to ensure high-quality, maintainable conversions.
Before Conversion
- Validate source format (HTML or Markdown)
- Identify target use case and requirements
- Plan for edge cases specific to your content
- Set up testing for quality assurance
During Conversion
- Use appropriate tools for your format pair
- Configure options to match your needs
- Handle exceptions and edge cases
- Log conversion warnings or errors
After Conversion
- Validate output format
- Verify content integrity
- Check for formatting issues
- Maintain version control
Batch Processing
Practical Conversion Examples
Real-world examples of common conversion scenarios.
Blog Post Migration
Documentation Generation
Static Site Generation
Content Preservation
Conclusion
Format conversion between HTML and Markdown is essential in modern web development. By understanding the strengths and limitations of each approach, you can choose the right tools and techniques for your specific use case.
Key takeaways:
- Markdown to HTML is straightforward with tools like marked or markdown-it
- HTML to Markdown requires more care due to semantic complexity
- Choose tools based on your specific needs and performance requirements
- Test edge cases relevant to your content
- Preserve content quality through careful conversion planning
- Validate both source and output formats
Start converting your content today with our free HTML-Markdown Converter tool!
# Headings
# H1 Heading
## H2 Heading
### H3 Heading
# Emphasis
**Bold text**
*Italic text*
***Bold and italic***
~~Strikethrough~~
# Lists
- Unordered item 1
- Unordered item 2
- Nested item
1. Ordered item 1
2. Ordered item 2
# Code
`inline code`
```javascript
// Code block with language
const message = "Hello World";
```
# Links and Images
[Link text](https://example.com)

# Blockquotes
> This is a quote
> with multiple lines
# Tables
| Header 1 | Header 2 |
|----------|----------|
| Cell 1 | Cell 2 | // Using marked.js (most popular)
// npm install marked
import { marked } from 'marked';
const markdown = `
# Hello World
This is a **bold** and *italic* example.
\`\``javascript
const x = 42;
\`\``
`;
// Convert to HTML
const html = marked(markdown);
console.log(html);
// With options
const html2 = marked(markdown, {
breaks: true, // Convert line breaks to <br>
gfm: true, // GitHub Flavored Markdown
headerIds: true, // Add IDs to headings
mangle: false // Don't escape header IDs
});
// Using markdown-it (more control)
// npm install markdown-it
import MarkdownIt from 'markdown-it';
const md = new MarkdownIt('commonmark');
const result = md.render(markdown);
// With plugins
import hljs from 'highlight.js';
md.set({ highlight: (code, lang) => {
if (lang && hljs.getLanguage(lang)) {
return hljs.highlight(code, { language: lang }).value;
}
return '';
}});
// Using remark (AST-based)
// npm install remark remark-html
import { remark } from 'remark';
import remarkHtml from 'remark-html';
const file = await remark()
.use(remarkHtml)
.process(markdown);
console.log(String(file)); # Using markdown module (simple)
# pip install markdown
import markdown
md_text = """
# Hello World
This is **bold** and *italic*.
```python
x = 42
print(x)
```
"""
html = markdown.markdown(md_text)
print(html)
# Using python-markdown with extensions
from markdown.extensions import codehilite, fenced_code, tables, toc
extensions = [
'codehilite',
'fenced_code',
'tables',
'toc'
]
html = markdown.markdown(md_text, extensions=extensions)
# Using pandoc (most powerful)
# pip install pandoc
import pandoc
# Convert markdown to HTML
doc = pandoc.read(md_text, format='markdown')
html = pandoc.write(doc, format='html')
# With options
html = pandoc.write(doc, format='html',
options=['--standalone', '--css', 'style.css'])
# Using mistune (fast, for performance)
# pip install mistune
import mistune
markdown_parser = mistune.create_markdown()
html = markdown_parser(md_text)
# Custom renderer
class CustomRenderer(mistune.HTMLRenderer):
def heading(self, text, level, **kwargs):
return f'<h{level} class="custom">{text}</h{level}>\n'
renderer = CustomRenderer()
markdown_parser = mistune.create_markdown(renderer=renderer)
html = markdown_parser(md_text) // Extended Markdown features
const markdown = `
# Task Lists (GitHub Markdown)
- [x] Completed task
- [ ] Incomplete task
- [x] Another done task
# Emoji Support
:smile: :heart: :thumbsup:
# Footnotes
This has a footnote[^1].
[^1]: The footnote content here.
# Superscript and Subscript
H~2~O (water)
E=mc^2 (Einstein)
# Abbreviations
HTML is great
*[HTML]: HyperText Markup Language
# Definition List
Term
: Definition of term
# Strikethrough
~~Deleted text~~ becomes ~~like this~~
# Front Matter
---
title: Example
date: 2024-01-01
---
# Content here
`;
// Processing with remark + plugins
import { remark } from 'remark';
import remarkHtml from 'remark-html';
import remarkGfm from 'remark-gfm';
import remarkMath from 'remark-math';
import rehypeKatex from 'rehype-katex';
const file = await remark()
.use(remarkGfm) // GitHub Flavored Markdown
.use(remarkMath) // Math support
.use(remarkHtml)
.process(markdown);
console.log(String(file));
// Or with markdown-it
import MarkdownIt from 'markdown-it';
import markdownItCheckbox from 'markdown-it-checkbox';
import markdownItEmoji from 'markdown-it-emoji';
import markdownItTable from 'markdown-it-table';
const md = new MarkdownIt()
.use(markdownItCheckbox)
.use(markdownItEmoji)
.use(markdownItTable);
const html = md.render(markdown); // Using TurndownJS (recommended for HTML to Markdown)
// npm install turndown
import TurndownService from 'turndown';
const html = `
<h1>Hello World</h1>
<p>This is a <strong>bold</strong> and <em>italic</em> paragraph.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
`;
const turndownService = new TurndownService();
const markdown = turndownService.turndown(html);
console.log(markdown);
// Basic HTML to Markdown using regex (simple cases only)
function simpleHtmlToMarkdown(html) {
return html
// Headers
.replace(/<h1>(.*?)<\/h1>/g, '# $1\n')
.replace(/<h2>(.*?)<\/h2>/g, '## $1\n')
.replace(/<h3>(.*?)<\/h3>/g, '### $1\n')
// Bold
.replace(/<strong>(.*?)<\/strong>/g, '**$1**')
.replace(/<b>(.*?)<\/b>/g, '**$1**')
// Italic
.replace(/<em>(.*?)<\/em>/g, '*$1*')
.replace(/<i>(.*?)<\/i>/g, '*$1*')
// Links
.replace(/<a href="(.*?)">(.*?)<\/a>/g, '[$2]($1)')
// Line breaks
.replace(/<br\s*\/?>/g, '\n')
.replace(/<p>(.*?)<\/p>/g, '$1\n\n')
// Remove remaining tags
.replace(/<[^>]*>/g, '');
}
// Using html-to-text (alternative)
// npm install html-to-text
import { convert } from 'html-to-text';
const plainText = convert(html, {
wordwrap: 80,
preserveNewlines: true
}); import TurndownService from 'turndown';
const turndownService = new TurndownService({
headingStyle: 'atx', // # instead of underline
hr: '---', // Horizontal rule style
bulletListMarker: '-', // Bullet style
codeBlockStyle: 'fenced', // Use backticks
emDelimiter: '*', // Use * for emphasis
strongDelimiter: '**' // Use ** for strong
});
// Add custom rules
turndownService.addRule('strikethrough', {
filter: ['s', 'del', 'strike'],
replacement: (content) => `~~${content}~~`
});
turndownService.addRule('highlight', {
filter: 'mark',
replacement: (content) => `**${content}**`
});
// Keep certain HTML
turndownService.keep(['figure', 'figcaption']);
const html = `
<article>
<h1>Article Title</h1>
<p>Introduction with <strong>bold</strong> text.</p>
<h2>Section</h2>
<p>Content with <del>deleted</del> and <mark>highlighted</mark> text.</p>
<figure>
<img src="image.jpg" alt="Description" />
<figcaption>Figure caption</figcaption>
</figure>
<ul>
<li>List item 1</li>
<li>List item 2</li>
</ul>
<blockquote>
<p>A quote</p>
</blockquote>
<pre><code class="language-javascript">
const x = 42;
</code></pre>
</article>
`;
const markdown = turndownService.turndown(html);
console.log(markdown); # Using markdownify (Python equivalent of TurndownJS)
# pip install markdownify
from markdownify import markdownify as md
html = """
<h1>Hello World</h1>
<p>This is a <strong>bold</strong> and <em>italic</em> paragraph.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
"""
markdown = md(html)
print(markdown)
# With options
markdown = md(html,
heading_style='atx', # Use # for headings
strong_em_symbol='*', # Use * and ** for emphasis
escape_misc_chars=True)
# Using html2text
# pip install html2text
import html2text
h = html2text.HTML2Text()
h.ignore_links = False
h.ignore_images = False
h.body_width = 0
markdown = h.handle(html)
print(markdown)
# Using pandoc (most comprehensive)
# pip install pandoc
import pandoc
# Convert HTML to Markdown
doc = pandoc.read(html, format='html')
markdown = pandoc.write(doc, format='markdown')
# Using BeautifulSoup for manual conversion
# pip install beautifulsoup4
from bs4 import BeautifulSoup
def html_to_markdown_custom(html):
soup = BeautifulSoup(html, 'html.parser')
markdown = []
for element in soup.descendants:
if element.name == 'h1':
markdown.append(f"# {element.get_text()}\n")
elif element.name == 'h2':
markdown.append(f"## {element.get_text()}\n")
elif element.name == 'p':
markdown.append(f"{element.get_text()}\n\n")
elif element.name == 'strong' or element.name == 'b':
# Already handled in parent
pass
return ''.join(markdown) // Advanced HTML to Markdown with DOM parsing
class HtmlToMarkdown {
constructor(options = {}) {
this.options = {
baseUrl: '',
imageFolder: 'images',
preserveInline: ['span', 'a'],
tableStyle: 'markdown', // or 'html'
...options
};
}
convert(html) {
const parser = new DOMParser();
const doc = parser.parseFromString(html, 'text/html');
return this.processNode(doc.body);
}
processNode(node) {
if (node.nodeType === Node.TEXT_NODE) {
return node.textContent.trim();
}
if (node.nodeType !== Node.ELEMENT_NODE) {
return '';
}
const tagName = node.tagName.toLowerCase();
const children = Array.from(node.childNodes)
.map(n => this.processNode(n))
.filter(s => s);
switch (tagName) {
case 'h1':
return `# ${children.join('')}\n\n`;
case 'h2':
return `## ${children.join('')}\n\n`;
case 'h3':
return `### ${children.join('')}\n\n`;
case 'p':
return `${children.join('')}\n\n`;
case 'strong':
case 'b':
return `**${children.join('')}**`;
case 'em':
case 'i':
return `*${children.join('')}*`;
case 'del':
case 's':
return `~~${children.join('')}~~`;
case 'a':
const href = node.getAttribute('href');
return `[\${children.join('')}](\${href})`;
case 'img':
const src = node.getAttribute('src');
const alt = node.getAttribute('alt') || 'image';
return ``;
case 'ul':
return `${children.map(c => '- ' + c).join('\n')}\n\n`;
case 'ol':
return children.map((c, i) => `${i + 1}. ${c}`).join('\n') + '\n\n';
case 'li':
return children.join('');
case 'code':
return `\`${children.join('')}\``;
case 'pre':
return `\`\`\`\n${children.join('')}\n\`\`\`\n\n`;
case 'blockquote':
return children.map(line => '> ' + line).join('\n') + '\n\n';
case 'table':
return this.convertTable(node);
default:
return children.join('');
}
}
convertTable(tableNode) {
const rows = Array.from(tableNode.querySelectorAll('tr'));
if (rows.length === 0) return '';
const lines = [];
const headerRow = rows[0];
const headers = Array.from(headerRow.querySelectorAll('th, td'))
.map(cell => cell.textContent.trim());
lines.push('| ' + headers.join(' | ') + ' |');
lines.push('|' + headers.map(() => '---|').join(''));
for (let i = 1; i < rows.length; i++) {
const cells = Array.from(rows[i].querySelectorAll('td'))
.map(cell => cell.textContent.trim());
lines.push('| ' + cells.join(' | ') + ' |');
}
return lines.join('\n') + '\n\n';
}
}
// Usage
const converter = new HtmlToMarkdown();
const markdown = converter.convert(htmlString); Tool | Language | Speed | Features | Use Case
| | | |
marked | JavaScript | Fast | Basic | Quick conversion, simple content
markdown-it | JavaScript | Good | Plugins | Extensible, plugins ecosystem
remark | JavaScript | Good | AST | Complex transformations, plugins
unified | JavaScript | Good | Plugins | Powerful plugin system
pandoc | CLI/Haskell| Good | Excellent| Universal conversion, quality
mistune | Python | Fast | Basic | Python projects, performance
python-markdown| Python | Good | Extensions| Python markdown processing
mkdocs | Python | Good | Site gen | Static documentation sites
hugo | Go | Fast | Full SSG | Static site generation Tool | Language | Quality | Features | Best For
| | | |
TurndownJS | JavaScript | Excellent| Customizable| Web conversion, complex HTML
html-to-text | JavaScript | Good | Simple | Plain text extraction
html2text | Python | Good | Simple | Python projects
markdownify | Python | Excellent| Configurable| Python markdown conversion
pandoc | CLI/Haskell| Excellent| Universal| Best quality conversion
BeautifulSoup+custom| Python | Custom | Control | Complex, custom logic
dom-parser+regex | Any | Basic | Simple | Quick, simple cases
jsdom+custom | JavaScript | Custom | Control | Node.js complex parsing
wkhtmltomarkdown | External | Fair | Limited | HTML to plain text // Common edge cases and solutions
const edgeCases = {
// 1. Nested lists with mixed types
nestedLists: `
<ul>
<li>Item 1
<ol>
<li>Sub-item A</li>
<li>Sub-item B</li>
</ol>
</li>
<li>Item 2</li>
</ul>
`,
// 2. Tables with merged cells
complexTables: `
<table>
<tr>
<th colspan="2">Header</th>
</tr>
<tr>
<td>Cell 1</td>
<td>Cell 2</td>
</tr>
</table>
`,
// 3. HTML special characters
specialChars: `
<p>Copyright © 2024 & Friends</p>
<p>Price: 99 €</p>
`,
// 4. Whitespace handling
whitespace: `
<p>
Multiple spaces
and line breaks
should be handled
</p>
`,
// 5. Mixed content
mixedContent: `
<p>
Text with <strong>bold</strong> and
<a href="/">link</a> and
<code>code</code> together
</p>
`,
// 6. Images with srcset
responsiveImages: `
<img src="small.jpg"
srcset="medium.jpg 500w, large.jpg 1000w"
alt="Responsive image" />
`,
// 7. Code blocks with HTML
codeWithHtml: `
<pre><code><div class="example">
<p>HTML content</p>
</div>
</code></pre>
`,
// 8. Deep nesting
deepNesting: `
<div><div><div><p>
Deeply nested content
</p></div></div></div>
`,
// 9. Empty elements
emptyElements: `
<p></p>
<div></div>
<ul><li></li></ul>
`,
// 10. Comments
withComments: `
<p>Visible</p>
<!-- This comment should be removed -->
<p>More visible</p>
`
};
// Solutions
const solutions = {
nestedLists: 'Use recursive indent handling',
complexTables: 'Markdown tables dont support colspan; convert to HTML or fallback',
specialChars: 'Decode HTML entities using entities library',
whitespace: 'Collapse multiple spaces, preserve intentional breaks',
mixedContent: 'Process recursively, handle inline elements',
responsiveImages: 'Use srcset in Markdown alt text or HTML fallback',
codeWithHtml: 'Double-escape HTML in code blocks',
deepNesting: 'Flatten unnecessary divs',
emptyElements: 'Remove or add placeholder content',
withComments: 'Strip comments during parsing'
}; // Techniques to preserve content quality during conversion
class QualityPreservingConverter {
constructor() {
this.warnings = [];
this.mappings = new Map();
}
// Track content mapping for validation
mapContent(original, converted) {
const key = this.hashContent(original);
this.mappings.set(key, { original, converted });
}
// Validate conversion completeness
validateConversion(original, converted) {
const origLength = original.text().length;
const convLength = converted.text().length;
if (convLength < origLength * 0.9) {
this.warnings.push('Content length reduced significantly');
}
const origWords = origLength / 5;
const convWords = convLength / 5;
if (Math.abs(origWords - convWords) / origWords > 0.1) {
this.warnings.push('Word count differs by >10%');
}
}
// Verify structure preservation
verifyStructure(original, converted) {
const origHeadings = original.querySelectorAll('h1,h2,h3,h4,h5,h6').length;
const convHeadings = converted.split('\n').filter(line =>
line.startsWith('#')).length;
if (origHeadings !== convHeadings) {
this.warnings.push(`Heading count mismatch: ${origHeadings} vs ${convHeadings}`);
}
const origLinks = original.querySelectorAll('a').length;
const convLinks = (converted.match(/\]\(/g) || []).length;
if (origLinks !== convLinks) {
this.warnings.push(`Link count mismatch: ${origLinks} vs ${convLinks}`);
}
}
// Extract semantic information
extractMetadata(html) {
return {
hasImages: html.includes('<img'),
hasCodeBlocks: html.includes('<code'),
hasLists: html.includes('<ul') || html.includes('<ol'),
hasTables: html.includes('<table'),
headingLevels: this.extractHeadingLevels(html)
};
}
// Round-trip testing
roundTripTest(original) {
const markdown = this.htmlToMarkdown(original);
const back = this.markdownToHtml(markdown);
const origTokens = this.tokenize(original);
const backTokens = this.tokenize(back);
const similarity = this.calculateSimilarity(origTokens, backTokens);
return {
markdown,
similarity,
isAcceptable: similarity > 0.95
};
}
hashContent(str) {
let hash = 0;
for (let i = 0; i < str.length; i++) {
const char = str.charCodeAt(i);
hash = ((hash << 5) - hash) + char;
hash = hash & hash;
}
return hash.toString();
}
tokenize(text) {
return text.match(/\b\w+\b/g) || [];
}
calculateSimilarity(arr1, arr2) {
const set1 = new Set(arr1);
const set2 = new Set(arr2);
const intersection = new Set([...set1].filter(x => set2.has(x)));
const union = new Set([...set1, ...set2]);
return intersection.size / union.size;
}
} // Unit tests for conversion quality
import { test, expect } from '@jest/globals';
import TurndownService from 'turndown';
import marked from 'marked';
describe('HTML to Markdown Conversion', () => {
const turndown = new TurndownService();
test('converts simple paragraph', () => {
const html = '<p>Hello world</p>';
const md = turndown.turndown(html);
expect(md).toContain('Hello world');
expect(md).not.toContain('<p>');
});
test('preserves bold text', () => {
const html = '<p><strong>Bold</strong> text</p>';
const md = turndown.turndown(html);
expect(md).toContain('**Bold**');
});
test('converts headings correctly', () => {
const html = '<h1>Title</h1><h2>Subtitle</h2>';
const md = turndown.turndown(html);
expect(md).toContain('# Title');
expect(md).toContain('## Subtitle');
});
test('handles lists properly', () => {
const html = '<ul><li>Item 1</li><li>Item 2</li></ul>';
const md = turndown.turndown(html);
expect(md).toMatch(/- Item [12]/);
});
test('preserves links', () => {
const html = '<a href="https://example.com">Link</a>';
const md = turndown.turndown(html);
expect(md).toContain('[Link](https://example.com)');
});
test('handles nested elements', () => {
const html = '<p><strong><em>Bold and italic</em></strong></p>';
const md = turndown.turndown(html);
expect(md).toContain('***Bold and italic***');
});
test('removes script tags', () => {
const html = '<p>Safe</p><script>alert("xss")</script>';
const md = turndown.turndown(html);
expect(md).not.toContain('script');
expect(md).toContain('Safe');
});
});
describe('Markdown to HTML Conversion', () => {
test('converts headings', () => {
const md = '# Title\n## Subtitle';
const html = marked(md);
expect(html).toContain('<h1>Title</h1>');
expect(html).toContain('<h2>Subtitle</h2>');
});
test('converts emphasis', () => {
const md = '**Bold** and *italic*';
const html = marked(md);
expect(html).toContain('<strong>Bold</strong>');
expect(html).toContain('<em>italic</em>');
});
test('converts links', () => {
const md = '[Link](https://example.com)';
const html = marked(md);
expect(html).toContain('href="https://example.com"');
expect(html).toContain('>Link<');
});
test('handles code blocks', () => {
const md = '```javascript\nconst x = 42;\n```';
const html = marked(md);
expect(html).toContain('<code');
expect(html).toContain('const x = 42');
});
});
describe('Round-trip Conversion', () => {
const turndown = new TurndownService();
test('markdown -> html -> markdown preserves content', () => {
const original = '# Title\nParagraph with **bold** text.';
const html = marked(original);
const back = turndown.turndown(html);
expect(back).toContain('Title');
expect(back).toContain('**bold**');
});
}); // Batch processing for multiple files
const fs = require('fs').promises;
const path = require('path');
const TurndownService = require('turndown');
const marked = require('marked');
class BatchConverter {
constructor(options = {}) {
this.options = options;
this.turndown = new TurndownService();
this.results = [];
}
async convertDirectory(inputDir, outputDir, format = 'md') {
try {
await fs.mkdir(outputDir, { recursive: true });
const files = await fs.readdir(inputDir);
for (const file of files) {
if (format === 'md' && file.endsWith('.html')) {
await this.convertFile(
path.join(inputDir, file),
path.join(outputDir, file.replace('.html', '.md')),
'html-to-md'
);
} else if (format === 'html' && file.endsWith('.md')) {
await this.convertFile(
path.join(inputDir, file),
path.join(outputDir, file.replace('.md', '.html')),
'md-to-html'
);
}
}
console.log(`Batch conversion complete: ${this.results.length} files`);
return this.results;
} catch (error) {
console.error('Batch conversion failed:', error);
throw error;
}
}
async convertFile(inputPath, outputPath, type) {
try {
const content = await fs.readFile(inputPath, 'utf-8');
let converted;
if (type === 'html-to-md') {
converted = this.turndown.turndown(content);
} else if (type === 'md-to-html') {
converted = marked(content);
}
await fs.writeFile(outputPath, converted, 'utf-8');
this.results.push({
input: inputPath,
output: outputPath,
status: 'success',
size: content.length
});
} catch (error) {
this.results.push({
input: inputPath,
status: 'error',
error: error.message
});
}
}
async convertWithProgress(files, type) {
const total = files.length;
for (let i = 0; i < total; i++) {
const file = files[i];
process.stdout.write(`\rProgress: ${i + 1}/${total}`);
// Convert file...
}
console.log('\nDone!');
}
}
// Usage
const converter = new BatchConverter();
// Convert all HTML files to Markdown
converter.convertDirectory('./html-files', './markdown-files', 'md')
.then(results => {
console.log('Results:', results);
})
.catch(error => console.error('Error:', error)); // Real-world: Migrating blog posts from HTML to Markdown
const fs = require('fs').promises;
const path = require('path');
const TurndownService = require('turndown');
class BlogMigrator {
constructor(sourceDir, targetDir) {
this.sourceDir = sourceDir;
this.targetDir = targetDir;
this.turndown = new TurndownService({
headingStyle: 'atx',
codeBlockStyle: 'fenced'
});
// Preserve certain blog-specific elements
this.turndown.keep(['figure', 'figcaption', 'picture']);
}
async migratePost(htmlFile) {
const html = await fs.readFile(htmlFile, 'utf-8');
const post = this.parsePost(html);
// Extract frontmatter
const frontmatter = this.extractFrontmatter(html);
// Convert content
const markdown = this.turndown.turndown(post.content);
// Combine with frontmatter
const output = this.createMarkdownPost(frontmatter, markdown);
return output;
}
parsePost(html) {
const parser = new DOMParser();
const doc = parser.parseFromString(html, 'text/html');
return {
title: doc.querySelector('h1')?.textContent || '',
date: doc.querySelector('[data-date]')?.getAttribute('data-date') || '',
author: doc.querySelector('[data-author]')?.textContent || '',
content: doc.querySelector('article')?.innerHTML ||
doc.querySelector('.post-content')?.innerHTML || ''
};
}
extractFrontmatter(html) {
const doc = new DOMParser().parseFromString(html, 'text/html');
return {
title: doc.querySelector('title')?.textContent || '',
description: doc.querySelector('meta[name="description"]')?.getAttribute('content') || '',
date: doc.querySelector('[data-date]')?.getAttribute('data-date') || new Date().toISOString(),
author: doc.querySelector('[data-author]')?.textContent || '',
tags: Array.from(doc.querySelectorAll('[data-tag]'))
.map(el => el.getAttribute('data-tag'))
};
}
createMarkdownPost(frontmatter, markdown) {
const yaml = `---
title: "${frontmatter.title}"
date: ${frontmatter.date}
author: ${frontmatter.author}
description: "${frontmatter.description}"
tags: [${frontmatter.tags.map(t => `"${t}"`).join(', ')}]
---
`;
return yaml + markdown;
}
async migrate() {
const files = await fs.readdir(this.sourceDir);
const htmlFiles = files.filter(f => f.endsWith('.html'));
for (const file of htmlFiles) {
try {
const input = path.join(this.sourceDir, file);
const output = path.join(this.targetDir, file.replace('.html', '.md'));
const markdown = await this.migratePost(input);
await fs.writeFile(output, markdown, 'utf-8');
console.log(`✓ Migrated: ${file}`);
} catch (error) {
console.error(`✗ Failed: ${file}`, error.message);
}
}
}
}
// Usage
const migrator = new BlogMigrator('./legacy-posts', './markdown-posts');
migrator.migrate(); // Generate documentation from Markdown source
const fs = require('fs').promises;
const path = require('path');
const marked = require('marked');
const hljs = require('highlight.js');
class DocumentationGenerator {
constructor(sourceDir, outputDir, config = {}) {
this.sourceDir = sourceDir;
this.outputDir = outputDir;
this.config = {
title: 'Documentation',
baseUrl: '/',
...config
};
// Configure marked with syntax highlighting
marked.setOptions({
highlight: (code, lang) => {
if (lang && hljs.getLanguage(lang)) {
return hljs.highlight(code, { language: lang }).value;
}
return hljs.highlightAuto(code).value;
},
breaks: true,
gfm: true
});
}
async generate() {
const files = await this.readMarkdownFiles(this.sourceDir);
const toc = this.generateTableOfContents(files);
for (const file of files) {
await this.generatePage(file, toc);
}
console.log(`Generated ${files.length} documentation pages`);
}
async generatePage(file, toc) {
const markdown = await fs.readFile(file.path, 'utf-8');
const html = marked(markdown);
const template = this.createTemplate(file.title, html, toc);
const outputPath = path.join(
this.outputDir,
file.name.replace('.md', '.html')
);
await fs.writeFile(outputPath, template, 'utf-8');
}
createTemplate(title, content, toc) {
return `<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width">
<title>${title} - ${this.config.title}</title>
<link rel="stylesheet" href="/styles.css">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.8.0/styles/default.min.css">
</head>
<body>
<div class="doc-container">
<nav class="toc">
${toc}
</nav>
<main class="doc-content">
<h1>${title}</h1>
${content}
</main>
</div>
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.8.0/highlight.min.js"></script>
<script>hljs.highlightAll();</script>
</body>
</html>`;
}
generateTableOfContents(files) {
return files
.map(f => `<li><a href="${f.name.replace('.md', '.html')}">${f.title}</a></li>`)
.join('');
}
async readMarkdownFiles(dir) {
const files = await fs.readdir(dir);
const markdownFiles = [];
for (const file of files) {
if (file.endsWith('.md')) {
markdownFiles.push({
name: file,
path: path.join(dir, file),
title: this.extractTitle(file)
});
}
}
return markdownFiles.sort((a, b) => a.name.localeCompare(b.name));
}
extractTitle(filename) {
return filename
.replace('.md', '')
.replace(/-/g, ' ')
.split(' ')
.map(w => w.charAt(0).toUpperCase() + w.slice(1))
.join(' ');
}
}
// Usage
const generator = new DocumentationGenerator(
'./docs-source',
'./docs-output',
{ title: 'API Documentation' }
);
generator.generate(); // Simple static site generator using Markdown
const fs = require('fs').promises;
const path = require('path');
const marked = require('marked');
const matter = require('gray-matter');
class StaticSiteGenerator {
constructor(sourceDir, outputDir) {
this.sourceDir = sourceDir;
this.outputDir = outputDir;
this.posts = [];
}
async build() {
await this.collectPosts();
await this.generatePages();
await this.generateIndex();
await this.generateRss();
console.log(`✓ Built ${this.posts.length} pages`);
}
async collectPosts() {
const files = await fs.readdir(this.sourceDir);
for (const file of files) {
if (file.endsWith('.md')) {
const content = await fs.readFile(
path.join(this.sourceDir, file),
'utf-8'
);
const { data, content: markdown } = matter(content);
const html = marked(markdown);
this.posts.push({
slug: file.replace('.md', ''),
title: data.title,
date: new Date(data.date),
author: data.author,
description: data.description,
tags: data.tags || [],
html,
content: markdown
});
}
}
// Sort by date descending
this.posts.sort((a, b) => b.date - a.date);
}
async generatePages() {
for (const post of this.posts) {
const html = this.createPostTemplate(post);
const outputPath = path.join(this.outputDir, `${post.slug}.html`);
await fs.mkdir(this.outputDir, { recursive: true });
await fs.writeFile(outputPath, html, 'utf-8');
}
}
async generateIndex() {
const postsHtml = this.posts
.slice(0, 10) // Show last 10 posts
.map(post => `
<article>
<h3><a href="/${post.slug}.html">${post.title}</a></h3>
<p class="meta">${post.date.toDateString()} by ${post.author}</p>
<p>${post.description}</p>
</article>
`)
.join('');
const html = `<!DOCTYPE html>
<html>
<head>
<title>Blog</title>
<link rel="stylesheet" href="/styles.css">
</head>
<body>
<header><h1>My Blog</h1></header>
<main>
${postsHtml}
</main>
</body>
</html>`;
await fs.writeFile(path.join(this.outputDir, 'index.html'), html, 'utf-8');
}
async generateRss() {
const rssItems = this.posts
.map(post => `
<item>
<title>${post.title}</title>
<link>https://example.com/${post.slug}.html</link>
<pubDate>${post.date.toUTCString()}</pubDate>
<description>${post.description}</description>
</item>
`)
.join('');
const rss = `<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>My Blog</title>
<link>https://example.com</link>
<description>Blog posts</description>
${rssItems}
</channel>
</rss>`;
await fs.writeFile(path.join(this.outputDir, 'feed.xml'), rss, 'utf-8');
}
createPostTemplate(post) {
return `<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width">
<title>${post.title}</title>
<link rel="stylesheet" href="/styles.css">
</head>
<body>
<article>
<h1>${post.title}</h1>
<p class="meta">${post.date.toDateString()} by ${post.author}</p>
<div class="content">
${post.html}
</div>
<footer>
<p>Tags: ${post.tags.join(', ')}</p>
</footer>
</article>
<nav><a href="/">← Back</a></nav>
</body>
</html>`;
}
}
// Usage
const ssg = new StaticSiteGenerator('./posts', './public');
ssg.build(); // Techniques to preserve all content during conversion
class ContentPreservingConverter {
constructor() {
this.metadata = {};
this.warnings = [];
}
// Preserve content with fallbacks
convertWithFallback(html) {
try {
// Primary conversion
return this.primaryConversion(html);
} catch (error) {
this.warnings.push('Primary conversion failed, using fallback');
return this.fallbackConversion(html);
}
}
primaryConversion(html) {
// Use best-quality conversion
const turndown = new TurndownService({
preserveNestedTables: true,
imageStyle: 'markdown'
});
// Add custom rules for preservation
turndown.addRule('preserve-section', {
filter: ['section', 'article'],
replacement: (content) => `
${content}
`
});
return turndown.turndown(html);
}
fallbackConversion(html) {
// Simple but safe fallback
return html
.replace(/<script[^>]*>.*?<\/script>/gs, '')
.replace(/<style[^>]*>.*?<\/style>/gs, '')
.replace(/<!--.*?-->/gs, '')
+ '\n\n<!-- Original HTML preserved below -->\n\n```html\n' + html + '\n```';
}
// Preserve important attributes
preserveAttributes(html) {
const doc = new DOMParser().parseFromString(html, 'text/html');
// Extract important data attributes
const dataAttributes = {};
doc.querySelectorAll('[data-*]').forEach(el => {
const attrs = {};
for (const attr of el.attributes) {
if (attr.name.startsWith('data-')) {
attrs[attr.name] = attr.value;
}
}
dataAttributes[el.id || el.className] = attrs;
});
this.metadata.attributes = dataAttributes;
return dataAttributes;
}
// Create metadata sidecar
createMetadata(originalHtml) {
const doc = new DOMParser().parseFromString(originalHtml, 'text/html');
return {
originalLength: originalHtml.length,
imageSources: Array.from(doc.querySelectorAll('img'))
.map(img => ({
src: img.src,
alt: img.alt,
srcset: img.srcset
})),
links: Array.from(doc.querySelectorAll('a'))
.map(a => ({
href: a.href,
text: a.textContent,
title: a.title
})),
metadata: {
lang: doc.documentElement.lang,
title: doc.title,
description: doc.querySelector('meta[name="description"]')?.content,
author: doc.querySelector('meta[name="author"]')?.content
}
};
}
}